Skip to content

feat(platform-wallet)!: shared ThreadRegistry for coordinator lifecycle + shutdown UAF/data-loss fixes#3954

Open
Claudius-Maginificent wants to merge 1 commit into
v4.0-devfrom
feat/platform-wallet-shutdown-join
Open

feat(platform-wallet)!: shared ThreadRegistry for coordinator lifecycle + shutdown UAF/data-loss fixes#3954
Claudius-Maginificent wants to merge 1 commit into
v4.0-devfrom
feat/platform-wallet-shutdown-join

Conversation

@Claudius-Maginificent

@Claudius-Maginificent Claudius-Maginificent commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Why this PR exists

What was done

  • rs-dash-async: ThreadRegistry, atomic flag guards, ShutdownReport / WorkerStatus.
  • rs-platform-wallet: CoordinatorLifecycle, registry-driven event-adapter task, shutdown() with weighted join + detached-orphan reporting; ShieldedShutdownIncomplete error + FFI code 19.

Testing

cargo fmt --all --check; cargo clippy --workspace --all-targets -- -D warnings; cargo test -p platform-wallet -p dash-async (dash-async 38; platform-wallet 322 with --all-features; 47 shutdown/registry tests incl. shutdown_joins_all_workers, shutdown_waits_for_in_flight_pass_to_drain, restart_in_place). The 4 Swift files need a separate iOS build check.

Breaking changes

! — adds an FFI error code (19) and a shutdown error variant.

Checklist

  • Self-review performed
  • Tests added/updated

🤖 Co-authored by Claudius the Magnificent AI Agent

@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The PR adds shared worker lifecycle primitives, refactors wallet sync managers to use them, and propagates incomplete-shutdown status through wallet, FFI, and Swift result handling.

Changes

Coordinator Shutdown and Lifecycle Refactor

Layer / File(s) Summary
Atomic guard and registry primitives
packages/rs-dash-async/src/atomic.rs, packages/rs-dash-async/src/lib.rs, packages/rs-dash-async/src/registry.rs, packages/rs-dash-async/Cargo.toml
AtomicFlagGuard is added as a drop-based atomic flag wrapper, and ThreadRegistry is added with worker status, shutdown reporting, worker config, lifecycle APIs, and tests. The rs-dash-async crate exports the new modules and feature wiring.
Shared coordinator lifecycle and manager refactor
packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs, packages/rs-platform-wallet/src/manager/identity_sync.rs, packages/rs-platform-wallet/src/manager/platform_address_sync.rs, packages/rs-platform-wallet/src/manager/shielded_sync.rs, packages/rs-platform-wallet/src/changeset/core_bridge.rs, packages/rs-platform-wallet/src/changeset/mod.rs, packages/rs-platform-wallet/Cargo.toml
CoordinatorLifecycle is introduced and then used by IdentitySyncManager, PlatformAddressSyncManager, and ShieldedSyncManager for interval tracking, pass gating, quiescing, and thread startup. The managers switch to ThreadRegistry, return CoordinatorThreadStatus from quiesce(), and update tests and construction paths accordingly.
Wallet shutdown status modeling and manager teardown
packages/rs-platform-wallet/src/manager/mod.rs, packages/rs-platform-wallet/src/error.rs, packages/rs-platform-wallet/src/lib.rs
PlatformWalletManager adds shutdown status modeling, registers the wallet event adapter with ThreadRegistry, returns CoordinatorExitStatus from shutdown(), and refuses clear_shielded() when shielded quiesce is incomplete. New tests cover status mapping, idempotency, orphan handling, and teardown ordering.
Wallet error and FFI result propagation
packages/rs-platform-wallet/src/error.rs, packages/rs-platform-wallet-ffi/src/error.rs, packages/rs-platform-wallet-ffi/Cargo.toml, packages/rs-platform-wallet/src/lib.rs
PlatformWalletError gains ShieldedShutdownIncomplete, PlatformWalletFFIResultCode gains ErrorShutdownIncomplete, and the FFI conversion maps the new wallet error to the new FFI result code. The crate root re-exports SHUTDOWN_JOIN_TIMEOUT_SECS.
FFI destroy and shielded sync shutdown handling
packages/rs-platform-wallet-ffi/src/manager.rs, packages/rs-platform-wallet-ffi/src/shielded_sync.rs, packages/swift-sdk/Sources/SwiftDashSDK/PlatformWallet/PlatformWalletManager.swift, packages/swift-sdk/Sources/SwiftDashSDK/PlatformWallet/PlatformWalletResult.swift
platform_wallet_manager_destroy now inspects shutdown status and returns ErrorShutdownIncomplete when the managers do not shut down cleanly. platform_wallet_manager_shielded_sync_stop uses a timeout around quiesce and maps non-clean results to the same FFI code, while platform_wallet_manager_shielded_clear maps ShieldedShutdownIncomplete separately. Swift result and error types add matching cases and mapping logic, and the Swift deinit path retains callback handlers on incomplete shutdown.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related issues

Suggested labels

ready for final review

Suggested reviewers

  • shumkov
  • lklimek
  • QuantumExplorer
  • thepastaclaw

Poem

🐇 I hopped through threads and counted each stop,
A guard kept flags from staying on top.
If shutdown is wobbly, the status will speak,
And callback tails linger just long as they պետք.
Clean joins at dusk, with a nibble and grin —
the moon sees a rabbit who knows when to spin.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the shared ThreadRegistry refactor and the shutdown safety fixes in platform-wallet.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/platform-wallet-shutdown-join

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@lklimek lklimek marked this pull request as ready for review June 23, 2026 08:06
@lklimek lklimek requested a review from thepastaclaw June 23, 2026 08:06
@codecov

codecov Bot commented Jun 23, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 94.54545% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.20%. Comparing base (48f0cc3) to head (a89e0f9).
⚠️ Report is 2 commits behind head on v3.1-dev.

Files with missing lines Patch % Lines
packages/rs-dash-async/src/atomic.rs 94.54% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           v3.1-dev    #3954      +/-   ##
============================================
+ Coverage     87.17%   87.20%   +0.02%     
============================================
  Files          2629     2634       +5     
  Lines        327221   328719    +1498     
============================================
+ Hits         285265   286666    +1401     
- Misses        41956    42053      +97     
Components Coverage Δ
dpp 87.70% <ø> (ø)
drive 86.14% <ø> (ø)
drive-abci 89.45% <ø> (ø)
sdk ∅ <ø> (∅)
dapi-client ∅ <ø> (∅)
platform-version ∅ <ø> (∅)
platform-value 92.20% <ø> (ø)
platform-wallet ∅ <ø> (∅)
drive-proof-verifier 49.55% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
packages/rs-platform-wallet/src/manager/identity_sync.rs (1)

418-459: 🩺 Stability & Availability | 🟠 Major

Don't overwrite an unjoined coordinator generation.

stop() removes only background_cancel and leaves the previous background_join unjoined. A stop()start() sequence passes the cancel_guard.is_some() check and overwrites background_join with a new handle, losing the old OS-thread handle before shutdown can join it. Gate restart on confirming the prior handle has been joined, or ensure every generation's handle is tracked and joined.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/rs-platform-wallet/src/manager/identity_sync.rs` around lines 418 -
459, The start() method can overwrite an existing background_join handle before
it has been properly joined, causing resource leaks. Before spawning the new
identity-sync thread and storing its handle in background_join, add a check to
ensure any existing join handle in background_join has been properly cleaned up
or joined first. This can be done either by joining the existing handle before
proceeding, or by verifying that background_join is None before allowing the new
thread spawn to proceed. This ensures the prior thread's OS handle is not lost
and can be properly shutdown.
packages/rs-platform-wallet/src/manager/platform_address_sync.rs (1)

218-255: 🩺 Stability & Availability | 🟠 Major

Add generation guard to prevent thread handle loss after stop/start cycles.

After stop() cancels the token, background_cancel becomes None while the old thread keeps running. A subsequent start() sees cancel_guard.is_some() == false and spawns a new thread, unconditionally overwriting background_join. The old thread's join handle is lost, making it impossible to join its cleanup later. Additionally, without a generation counter, the exiting old thread clears the new generation's background_cancel token as it shuts down, creating a race where the new loop runs but appears stopped to is_running().

Both IdentitySyncManager and ShieldedSyncManager already implement this pattern: they increment background_generation on each start(), pass my_gen to the spawned thread, and check background_generation.load() == my_gen before clearing background_cancel on exit. Apply the same approach here.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/rs-platform-wallet/src/manager/platform_address_sync.rs` around
lines 218 - 255, The thread handle for the platform address sync loop can be
lost during stop/start cycles because a new thread spawns and unconditionally
overwrites the background_join handle before the old thread finishes cleanup.
Additionally, the old thread clears the new generation's background_cancel
token, creating a race condition. Implement the generation guard pattern already
used in IdentitySyncManager and ShieldedSyncManager: add a background_generation
atomic counter to the struct, increment it at the start of the start() method,
capture the current generation as my_gen before spawning the thread, pass my_gen
into the spawned thread closure, and modify the cleanup code (where
background_cancel is set to None) to only clear the token if
background_generation.load() equals my_gen, ensuring old exiting threads do not
interfere with new generations.
packages/rs-platform-wallet/src/manager/shielded_sync.rs (1)

245-300: 🩺 Stability & Availability | 🔴 Critical

Don't replace a pending shielded-sync join handle on rapid stop/start cycles.

The start() method checks only background_cancel to guard against concurrent starts. When stop() removes the token from background_cancel, a subsequent start() proceeds to spawn a new thread and overwrites background_join, even though the previous generation's thread is still winding down. The generation check (line 290) only prevents the old thread from clearing background_cancel—it does not protect background_join. This leaves prior-generation handles permanently lost.

To fix: either require quiesce() before start() to join all pending generations, or track join handles per-generation and join all pending before spawning a new thread.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/rs-platform-wallet/src/manager/shielded_sync.rs` around lines 245 -
300, The `start()` method only checks `background_cancel` to guard against
concurrent starts, but when `stop()` clears this token, a subsequent `start()`
can spawn a new thread and overwrite the `background_join` handle before the
previous generation's thread has finished cleaning up. The generation check at
the cleanup section prevents the old thread from clearing `background_cancel`,
but does not protect `background_join` from being overwritten. Fix this by
either adding logic to join all pending prior-generation join handles before
storing the new one (by tracking handles per-generation), or by ensuring that
before assigning a new join handle to `background_join` in the `start()` method,
any existing pending handle from a prior generation is properly joined and
waited for completion.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/rs-platform-wallet/src/manager/mod.rs`:
- Around line 706-710: The while loop waiting for handler_started to become true
has no timeout, which will cause the test to hang indefinitely if the slow
callback never executes. Wrap the entire while loop (that checks
handler_started.load(AO::Acquire)) with tokio::time::timeout() and provide an
appropriate timeout duration, then handle the timeout error case with an
assertion that explicitly fails the test with a useful message. This ensures CI
fails fast with clear feedback instead of timing out.
- Around line 483-489: In the wallet event adapter task join error handling, the
else branch that handles non-panic JoinErrors currently returns
CoordinatorThreadStatus::Ok, which should instead return
CoordinatorThreadStatus::Error with an appropriate error message. Change the
else clause that follows the is_panic() check to map the error to a
CoordinatorThreadStatus::Error variant containing details about the join error,
rather than treating it as a successful completion.
- Around line 175-181: The current implementation uses
tokio::task::spawn_blocking to wrap handle.join(), but this pattern prevents the
timeout from effectively interrupting the blocking task if the coordinator
thread hangs. Replace the spawn_blocking closure approach with explicit polling:
repeatedly check if the JoinHandle is finished using is_finished() in a loop
until the deadline is reached, and only call join() once the handle confirms it
is finished. This ensures the timeout boundary is enforced even if the
coordinator thread misbehaves or fails to clear is_syncing before join() is
called.

---

Outside diff comments:
In `@packages/rs-platform-wallet/src/manager/identity_sync.rs`:
- Around line 418-459: The start() method can overwrite an existing
background_join handle before it has been properly joined, causing resource
leaks. Before spawning the new identity-sync thread and storing its handle in
background_join, add a check to ensure any existing join handle in
background_join has been properly cleaned up or joined first. This can be done
either by joining the existing handle before proceeding, or by verifying that
background_join is None before allowing the new thread spawn to proceed. This
ensures the prior thread's OS handle is not lost and can be properly shutdown.

In `@packages/rs-platform-wallet/src/manager/platform_address_sync.rs`:
- Around line 218-255: The thread handle for the platform address sync loop can
be lost during stop/start cycles because a new thread spawns and unconditionally
overwrites the background_join handle before the old thread finishes cleanup.
Additionally, the old thread clears the new generation's background_cancel
token, creating a race condition. Implement the generation guard pattern already
used in IdentitySyncManager and ShieldedSyncManager: add a background_generation
atomic counter to the struct, increment it at the start of the start() method,
capture the current generation as my_gen before spawning the thread, pass my_gen
into the spawned thread closure, and modify the cleanup code (where
background_cancel is set to None) to only clear the token if
background_generation.load() equals my_gen, ensuring old exiting threads do not
interfere with new generations.

In `@packages/rs-platform-wallet/src/manager/shielded_sync.rs`:
- Around line 245-300: The `start()` method only checks `background_cancel` to
guard against concurrent starts, but when `stop()` clears this token, a
subsequent `start()` can spawn a new thread and overwrite the `background_join`
handle before the previous generation's thread has finished cleaning up. The
generation check at the cleanup section prevents the old thread from clearing
`background_cancel`, but does not protect `background_join` from being
overwritten. Fix this by either adding logic to join all pending
prior-generation join handles before storing the new one (by tracking handles
per-generation), or by ensuring that before assigning a new join handle to
`background_join` in the `start()` method, any existing pending handle from a
prior generation is properly joined and waited for completion.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 01cab324-105d-4c5d-afe0-0ceb6faff13e

📥 Commits

Reviewing files that changed from the base of the PR and between 9a93387 and 261178e.

📒 Files selected for processing (5)
  • packages/rs-platform-wallet-ffi/src/manager.rs
  • packages/rs-platform-wallet/src/manager/identity_sync.rs
  • packages/rs-platform-wallet/src/manager/mod.rs
  • packages/rs-platform-wallet/src/manager/platform_address_sync.rs
  • packages/rs-platform-wallet/src/manager/shielded_sync.rs

Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/identity_sync.rs Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
packages/rs-platform-wallet/src/manager/mod.rs (2)

405-408: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Handle unclean shielded quiesce before clearing state.

quiesce() now returns a meaningful shutdown status. Ignoring it lets clear_shielded() proceed to coord.clear() after Timeout, Stopped, or Panicked, which can race a still-running shielded pass that the quiesce barrier was meant to stop.

Proposed fix
-        self.shielded_sync_manager.quiesce().await;
+        let status = self.shielded_sync_manager.quiesce().await;
+        if !status.is_clean() {
+            return Err(crate::error::PlatformWalletError::ShieldedStoreError(
+                format!("shielded sync did not stop cleanly before clear: {status:?}"),
+            ));
+        }
         if let Some(coord) = self.shielded_coordinator().await {
             coord.clear().await?;
         }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/rs-platform-wallet/src/manager/mod.rs` around lines 405 - 408, The
clear_shielded function ignores the return value from
self.shielded_sync_manager.quiesce().await, which now provides meaningful
shutdown status information. Instead of ignoring this result, capture the return
value and check if the quiesce completed cleanly. If quiesce returns a status
indicating Timeout, Stopped, or Panicked, return an appropriate error from
clear_shielded rather than continuing to call coord.clear(), which could race
with a still-running shielded pass. Only proceed with coord.clear() when quiesce
has successfully shut down cleanly.

463-477: 🩺 Stability & Availability | 🟠 Major

Wrap quiescing in AtomicFlagGuard to ensure cancellation-safe reset in all coordinator quiesce() implementations.

The current implementations set quiescing = true before the awaited drain loop and reset it only after. If tokio::time::timeout drops the future during the loop, the reset never executes, permanently wedging quiescing and blocking all future syncs.

All three coordinators (platform_address_sync.rs:291, identity_sync.rs:494, shielded_sync.rs:334) have identical patterns. Use the same AtomicFlagGuard approach already correctly applied to is_syncing in sync_now():

pub async fn quiesce(&self) -> super::CoordinatorThreadStatus {
    self.quiescing.store(true, Ordering::Release);
    let _quiescing_guard = AtomicFlagGuard::new(&self.quiescing);
    self.stop();
    while self.is_syncing.load(Ordering::Acquire) {
        tokio::time::sleep(Duration::from_millis(20)).await;
    }
    // quiescing.store(false) removed — guard handles reset on all exit paths
    // ...rest of implementation
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/rs-platform-wallet/src/manager/mod.rs` around lines 463 - 477, The
`quiesce()` method implementations in all three coordinators (the ones
containing the timeout calls for `platform_address_sync_manager`,
`identity_sync_manager`, and `shielded_sync_manager`) don't properly handle
cancellation when `tokio::time::timeout` drops the future. Wrap the `quiescing`
flag reset in an `AtomicFlagGuard` to ensure it's reset on all exit paths
including early cancellation. In each coordinator's `quiesce()` method, after
setting `quiescing` to true, immediately create an `AtomicFlagGuard` using
`AtomicFlagGuard::new(&self.quiescing)`, and remove any manual
`quiescing.store(false)` reset call at the end since the guard will handle it
automatically. Use the same pattern already correctly implemented in
`sync_now()` for the `is_syncing` flag.
🧹 Nitpick comments (1)
packages/rs-dash-async/src/atomic.rs (1)

8-15: 📐 Maintainability & Code Quality | 🔵 Trivial

Add #[must_use] annotation to AtomicFlagGuard.

The guard is dropped as a temporary if not bound, silently resetting the flag immediately. This breaks the intended guarded scope behavior. Mark the type with #[must_use] to catch accidental non-binding at compile time.

Proposed fix
-pub struct AtomicFlagGuard<'a>(&'a AtomicBool);
+#[must_use = "AtomicFlagGuard clears the flag on drop; bind it to keep the flag set for the guarded scope"]
+pub struct AtomicFlagGuard<'a>(&'a AtomicBool);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/rs-dash-async/src/atomic.rs` around lines 8 - 15, The
AtomicFlagGuard struct can be accidentally dropped without being bound to a
variable, causing the flag to be reset immediately and breaking the guarded
scope behavior. Add the #[must_use] attribute to the AtomicFlagGuard struct
definition to make the compiler warn when the guard is not explicitly bound to a
variable, ensuring the developer catches this mistake at compile time.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/rs-platform-wallet/src/manager/mod.rs`:
- Around line 671-695: The test
`event_adapter_non_panic_join_error_maps_to_stopped_and_is_not_clean` is
non-deterministic because it accepts both `CoordinatorThreadStatus::Stopped` and
`CoordinatorThreadStatus::Ok` as valid outcomes. This allows the test to pass
without actually exercising the non-panic JoinError branch. To fix this, replace
the current task abort approach with a task that is guaranteed to be pending and
never complete on its own, such as a task that awaits on a channel or a
never-resolving future. This ensures the abort always triggers the Stopped path
deterministically, and update the assertion to only expect
`CoordinatorThreadStatus::Stopped`.

---

Outside diff comments:
In `@packages/rs-platform-wallet/src/manager/mod.rs`:
- Around line 405-408: The clear_shielded function ignores the return value from
self.shielded_sync_manager.quiesce().await, which now provides meaningful
shutdown status information. Instead of ignoring this result, capture the return
value and check if the quiesce completed cleanly. If quiesce returns a status
indicating Timeout, Stopped, or Panicked, return an appropriate error from
clear_shielded rather than continuing to call coord.clear(), which could race
with a still-running shielded pass. Only proceed with coord.clear() when quiesce
has successfully shut down cleanly.
- Around line 463-477: The `quiesce()` method implementations in all three
coordinators (the ones containing the timeout calls for
`platform_address_sync_manager`, `identity_sync_manager`, and
`shielded_sync_manager`) don't properly handle cancellation when
`tokio::time::timeout` drops the future. Wrap the `quiescing` flag reset in an
`AtomicFlagGuard` to ensure it's reset on all exit paths including early
cancellation. In each coordinator's `quiesce()` method, after setting
`quiescing` to true, immediately create an `AtomicFlagGuard` using
`AtomicFlagGuard::new(&self.quiescing)`, and remove any manual
`quiescing.store(false)` reset call at the end since the guard will handle it
automatically. Use the same pattern already correctly implemented in
`sync_now()` for the `is_syncing` flag.

---

Nitpick comments:
In `@packages/rs-dash-async/src/atomic.rs`:
- Around line 8-15: The AtomicFlagGuard struct can be accidentally dropped
without being bound to a variable, causing the flag to be reset immediately and
breaking the guarded scope behavior. Add the #[must_use] attribute to the
AtomicFlagGuard struct definition to make the compiler warn when the guard is
not explicitly bound to a variable, ensuring the developer catches this mistake
at compile time.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 192ea517-ff8a-4b43-9773-c096391c8a49

📥 Commits

Reviewing files that changed from the base of the PR and between 261178e and 6e78b77.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (7)
  • packages/rs-dash-async/src/atomic.rs
  • packages/rs-dash-async/src/lib.rs
  • packages/rs-platform-wallet/Cargo.toml
  • packages/rs-platform-wallet/src/manager/identity_sync.rs
  • packages/rs-platform-wallet/src/manager/mod.rs
  • packages/rs-platform-wallet/src/manager/platform_address_sync.rs
  • packages/rs-platform-wallet/src/manager/shielded_sync.rs
🚧 Files skipped from review as they are similar to previous changes (2)
  • packages/rs-platform-wallet/src/manager/platform_address_sync.rs
  • packages/rs-platform-wallet/src/manager/shielded_sync.rs

Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
@thepastaclaw

thepastaclaw commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

✅ Review complete (commit a89e0f9)

Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/platform_address_sync.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/platform_address_sync.rs Outdated
Comment thread packages/rs-dash-async/src/atomic.rs
Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
Comment thread packages/rs-dash-async/src/atomic.rs
Comment thread packages/rs-dash-async/src/atomic.rs
Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/mod.rs
Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
@Claudius-Maginificent

Copy link
Copy Markdown
Collaborator Author

🟠 MEDIUM — Sibling Stop/Clear paths call quiesce() with no timeout (outside-diff-range)

Posted at PR level because the affected call sites are not part of this PR's diff.

shutdown() (mod.rs) carefully bounds every quiesce() with the 30s SHUTDOWN_JOIN_TIMEOUT_SECS. The sibling entry points that call the same quiesce() do not:

  • PlatformWalletManager::clear_shieldedpackages/rs-platform-wallet/src/manager/mod.rs:406 (self.shielded_sync_manager.quiesce().await;, unbounded)
  • FFI platform_wallet_manager_shielded_sync_stoppackages/rs-platform-wallet-ffi/src/shielded_sync.rs:91 (runtime().block_on(manager.shielded_sync().quiesce()))
  • FFI platform_wallet_manager_shielded_clear…/shielded_sync.rs:442 (→ clear_shielded)

quiesce() contains the unbounded while is_syncing { sleep(20ms) } drain plus the newly-added join_coordinator_thread. If an in-flight shielded pass stalls (slow/partitioned network), the drain spins forever and the calling (Swift/UI) thread blocks permanently in block_on — a user tapping Clear shielded or Stop sync wedges the app with no recovery. The new join only lengthens the wait. The PR introduced the bounded-wait discipline for shutdown() but didn't extend it to these user-initiated paths that share the identical drain.

Fix: apply the same timeout bound (or a Stop/Clear-specific one) around these quiesce() calls and surface a timeout result; better still, make the underlying pass cancellable so cancellation cuts the stalled await.

🤖 Co-authored by Claudius the Magnificent AI Agent

Comment thread packages/rs-platform-wallet/src/manager/platform_address_sync.rs Outdated

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Solid lifecycle hardening overall — joining coordinator threads and the RAII is_syncing guard close real races. Three in-scope blockers undermine the new shutdown contract: the FFI destroy still returns success on non-clean shutdown (Swift may free a context a coordinator thread still owns), the bounded timeout doesn't actually bound anything because spawn_blocking tasks are non-abortable, and start-after-stop overwrites the saved JoinHandle so a later shutdown cannot join the stranded thread. Two suggestions and two test-quality nits round out the review.

🔴 3 blocking | 🟡 2 suggestion(s) | 💬 2 nitpick(s)

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-platform-wallet-ffi/src/manager.rs`:
- [BLOCKING] packages/rs-platform-wallet-ffi/src/manager.rs:366-376: Destroy returns ok() on non-clean shutdown — re-opens callback-after-free window
  `shutdown()` can now legitimately return `CoordinatorThreadStatus::Timeout` (and `Stopped`/`Panicked`/`Error`) meaning a coordinator's OS thread or the event-adapter task did not actually join. This FFI entry point logs that outcome and still returns `PlatformWalletFFIResult::ok()`. The Swift host (`PlatformWalletManager.swift` deinit) is documented to free the callback `context` once `platform_wallet_manager_destroy` returns; meanwhile the still-alive coordinator thread holds an `Arc<FFIEventHandler>` / `Arc<FFIPersister>` and can fire `on_*_sync_completed` or `persister.store(...)` through the now-dangling `context` pointer. That is precisely the use-after-free this PR set out to close — the previous unbounded wait would have hung instead of returning false success. Surface non-clean shutdowns as a distinct, non-ok result code so the host knows not to free its context (or keep the FFI-owned handler `Arc` alive on the non-clean path, e.g. `mem::forget`, so any lingering callback remains memory-safe).

In `packages/rs-platform-wallet/src/manager/mod.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/mod.rs:178-192: Timeout doesn't actually bound shutdown — spawn_blocking is non-abortable
  `shutdown()` wraps each `quiesce()` in `tokio::time::timeout`, but `join_coordinator_thread` moves the `std::thread::JoinHandle` into `spawn_blocking(move || handle.join())`. Tokio blocking tasks cannot be aborted once started: dropping the outer timeout future stops `await`ing the `JoinHandle` but the underlying blocking task is still parked inside `handle.join()`, keeping the coordinator thread's `Arc<…SyncManager>` (and the host callback contexts it transitively holds) alive. When the caller then drops the multi-thread runtime, `Runtime::drop` returns without waiting for blocking tasks, leaving an OS thread plus a stranded blocking thread alive on a freed runtime — which is the very `"Tokio 1.x context … being shutdown"` race the PR cites in its motivation. Make the join cancellation-aware by polling `handle.is_finished()` (with a short async sleep) before the final `handle.join()`, so dropping the timeout future actually releases all state.
- [SUGGESTION] packages/rs-platform-wallet/src/manager/mod.rs:441-459: assert! on runtime flavor is incorrect — spawn_blocking works on current_thread
  The promotion from `debug_assert!` to `assert!` is justified in the docstring by the claim that `spawn_blocking` 'is not available on `current_thread` runtimes and will panic there.' That's not how Tokio works: `spawn_blocking` dispatches to the runtime's shared blocking pool, which both `multi_thread` and `current_thread` runtimes provision; awaiting the returned `JoinHandle` simply yields the runtime task. Today's only in-tree caller (`platform-wallet-ffi/src/runtime.rs:34`) builds a multi-thread runtime, but `shutdown()` is now a public Rust API. Any downstream consumer using `#[tokio::main(flavor = "current_thread")]` (or `Builder::new_current_thread()`) will hit a release-mode panic on a configuration that would have worked. Revert to `debug_assert!` (or drop it entirely) and correct the docstring rationale.

In `packages/rs-platform-wallet/src/manager/identity_sync.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/identity_sync.rs:405-448: start-after-stop overwrites background_join, dropping the previous thread handle
  `stop()` takes `background_cancel` (leaving it `None`) but never touches `background_join`. A subsequent `start()` passes the `cancel_guard.is_some()` early-return check, spawns a fresh thread, and at line 447 unconditionally overwrites `background_join` with the new handle — the previous handle is dropped (detached). The old coordinator thread is cancellation-bound but not yet exited; it can still be inside its `block_on` polling `tokio::time` and feeding the event manager when the new handle replaces it. A later `shutdown()` then can only join the new thread, so the original thread can outlive `shutdown()` and touch the host's freed callback context — recreating exactly the runtime-drop race this PR is meant to eliminate. The same pattern applies to `platform_address_sync` and `shielded_sync`. Fix by taking-and-joining (or refusing) any non-`None` handle inside the `start()` lock before installing the new one.

In `packages/rs-platform-wallet/src/manager/platform_address_sync.rs`:
- [SUGGESTION] packages/rs-platform-wallet/src/manager/platform_address_sync.rs:291-304: quiescing gate is reopened before the join completes — sync_now can slip a pass through
  `quiesce()` clears `self.quiescing` to `false` *before* taking `background_join` and awaiting `join_coordinator_thread`. The join can block for up to `SHUTDOWN_JOIN_TIMEOUT_SECS` (30s). During that window the loop is cancelled and won't start new passes, but an external caller invoking `sync_now` (e.g. an FFI on-demand sync) finds a fully open gate — `quiescing=false`, `is_syncing=false` — and runs a complete pass, including the `on_platform_address_sync_completed` host callback. That breaks the documented `quiesce()` contract that no further callback can fire after it returns, undermining the manager-level shutdown guarantee. Same pattern in `identity_sync::quiesce` and `shielded_sync::quiesce`. Move `quiescing.store(false, …)` to *after* the join completes (or have `sync_now`/`sync_wallet` also consult `background_join.is_some()`).

Comment thread packages/rs-platform-wallet-ffi/src/manager.rs
Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/identity_sync.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/mod.rs
Comment thread packages/rs-platform-wallet/src/manager/platform_address_sync.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/mod.rs
Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
@Claudius-Maginificent

Copy link
Copy Markdown
Collaborator Author

Update — SEC-002 fixed in 93b89546ed: clear_shielded and the FFI platform_wallet_manager_shielded_sync_stop bridge now wrap quiesce() in the same SHUTDOWN_JOIN_TIMEOUT_SECS bound shutdown() uses, so a stalled shielded pass can no longer hang the host thread indefinitely.

🤖 Co-authored by Claudius the Magnificent AI Agent

@lklimek

lklimek commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

@thepastaclaw please review

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Incremental review against head 3cca1cf. The new tokio::join! concurrent drain in shutdown() is structurally sound — each coordinator's quiesce() touches only its own quiescing/is_syncing atomics and background_cancel/background_join mutexes, and the shared coordinator_orphans Mutex is not taken inside quiesce(), so racing the three drains is safe; per-coordinator inner timeouts correctly preserve per-thread CoordinatorThreadStatus that a single outer timeout would have flattened. However, both prior blocking findings are unchanged on this head: a timed-out quiesce still loses the only JoinHandle and silently detaches a live coordinator OS thread, and clear_shielded still wipes the shielded store without consulting coordinator_orphans. Both were unanimously re-flagged by all six agent reviews and both are required for this PR's stated lifetime/UAF guarantees to actually hold.

🔴 1 blocking

1 additional finding(s) omitted (not in diff).

1 carried-forward finding(s) already raised on this PR; not re-posting as new inline comments.

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-platform-wallet/src/manager/mod.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/mod.rs:644-674: clear_shielded wipes the shielded store without draining parked coordinator orphans
  clear_shielded (mod.rs:644–674) gates coord.clear().await on shielded_sync_manager.quiesce() returning is_clean(), and the doc on SHUTDOWN_JOIN_TIMEOUT_SECS (mod.rs:407–408) explicitly lists clear_shielded and the FFI shielded-stop bridge as using the same backstop. But neither path inspects self.coordinator_orphans. Shielded coordinator start() parks a prior wedged shielded OS thread into that shared list via reap_prior_or_park (mod.rs:296–332) whenever a tight stop()→start() cannot reap within the 1 s wedge backstop. Only shutdown() actually drains the list via join_detached_orphans (mod.rs:820–824).

  A parked shielded thread still holds an Arc to the same NetworkShieldedCoordinator and persister context that coord.clear() (mod.rs:670–672) is about to wipe; the doc on lines 663–666 warns exactly about this — "a surviving pass writing into a store we just cleared, desyncing the host's own wipe from a repopulated tree." Because the orphan drain is missing, a sequence of stop()→start() that wedges the prior shielded thread, followed by clear_shielded, lets the wipe proceed while the parked thread is still alive and can persister.store(...) / fire on_shielded_* host callbacks against the cleared store.

  The same defect is in the FFI platform_wallet_manager_shielded_sync_stop bridge: it awaits only quiesce(), so a non-clean orphan goes unreported and the host (which sees ok()) is free to release the callback context that a parked shielded thread still references.

  Fix shape (symmetric with shutdown()): in both clear_shielded and platform_wallet_manager_shielded_sync_stop, after quiesce() returns clean, run a bounded join_detached_orphans(&self.coordinator_orphans, deadline) and fold the result into the precondition — return ShieldedShutdownIncomplete { status } (and leave the store intact) unless both the live drain and the orphan drain are clean. Caveat: coordinator_orphans is shared across all three coordinator kinds; either accept the conservative wait for identity/platform-address orphans before a shielded Clear, or tag parked handles with their coordinator kind so Clear can drain only the shielded subset.

@Claudius-Maginificent Claudius-Maginificent changed the title feat(platform-wallet)!: shutdown() joins coordinator threads and returns CoordinatorExitStatus feat(platform-wallet)!: shared ThreadRegistry for coordinator lifecycle + shutdown UAF/data-loss fixes Jun 25, 2026

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Both prior blocking findings are structurally fixed in the new ThreadRegistry engine (prior-1 by the slot-owned JoinHandle + generation-scoped Repark drop-guard; prior-2 by clear_shielded's any_alive_for(ShieldedSync) gate under a held quiescing guard). One new blocking regression was introduced by the CoordinatorLifecycle dedup: the shared quiesce() no longer raises the quiescing gate itself or drains is_syncing — it only delegates to ThreadRegistry::quiesce, whose drain hook is skipped when no slot is registered — so a direct sync_now/sync_wallet in flight (which the FFI exposes without requiring the background loop to be running) is no longer awaited before quiesce reports clean. Two narrower follow-ups (symmetric orphan check on shielded_sync_stop FFI; out-of-lock prior-handle reap racing shutdown's orphan reap) round out the review.

🔴 1 blocking | 🟡 2 suggestion(s)

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:146-149: Shared quiesce() no longer drains direct in-flight sync passes
  The pre-refactor per-coordinator `quiesce()` explicitly raised `quiescing` and awaited `is_syncing` to fall before joining (`self.quiescing.store(true, ...); self.stop(); while self.is_syncing { sleep(20ms) }`). The new shared `CoordinatorLifecycle::quiesce` is `let _g = AtomicFlagGuard::new(&self.quiescing); self.registry.quiesce(self.worker).await.into()`. `AtomicFlagGuard::new` (rs-dash-async/src/atomic.rs:18-22) deliberately does **not** raise the flag — only the registry's drain hook does, and `ThreadRegistry::quiesce` (registry.rs:507-519) early-returns `NotRunning` without invoking the drain hook whenever no live slot is registered.

  The FFI exposes `platform_wallet_manager_shielded_sync_sync_now` / `shielded_sync_wallet` (shielded_sync.rs:208, 590) — and the same shape for identity / platform-address — which call `shielded_sync().sync_now(true)` / `sync_wallet(..., true)` directly, with no requirement that the background loop is running. So a host-driven direct pass can be in flight while `clear_shielded` calls `quiesce()`: the registry has no slot, the drain hook never fires, the gate is never raised, and `quiesce()` returns `NotRunning` — clean — immediately. `clear_shielded` (mod.rs:531-562) then takes the `is_clean()` branch, raises the gate via `hold_quiescing_gate()` (too late — the in-flight pass already passed `begin_pass()` and does not recheck mid-pass), and `any_alive_for(ShieldedSync)` returns false (it only inspects slots/orphans, not `is_syncing`). `coord.clear().await` wipes the store while the direct pass keeps writing through `coord.sync(...)` and its persister fan-out.

  This directly contradicts the documented contract on `clear_shielded` (mod.rs:517-518: "A concurrent direct `sync_now`/`sync_wallet` is held off (the quiescing gate stays raised across the liveness check and the wipe)"), and is the same store-desync class the prior F2 finding closed for parked orphans. Fix by restoring the explicit gate raise + `is_syncing` drain in the shared `quiesce`, before delegating to the registry — symmetric with what each coordinator did before the dedup.

In `packages/rs-platform-wallet-ffi/src/shielded_sync.rs`:
- [SUGGESTION] packages/rs-platform-wallet-ffi/src/shielded_sync.rs:117-134: shielded_sync_stop FFI does not consult any_alive_for(ShieldedSync)
  `platform_wallet_manager_shielded_sync_stop` returns `Success` whenever the per-key `quiesce()` lands clean (Ok/NotRunning). It does not consult `registry.any_alive_for(WalletWorker::ShieldedSync)` — the symmetric gate `clear_shielded` adopted at manager/mod.rs:553 (the F2 fix) and the destroy path covers via `CoordinatorExitStatus::all_clean()` which folds in `detached_threads`.

  A prior tight start->stop reap that had to park a wedged shielded thread as an orphan would not block the next stop's `Success`, even though the parked thread still holds `Arc`s to the host's `FFIEventHandler::context` / `FFIPersister::context` pointers and can fire one final callback through them. The function's docstring (shielded_sync.rs:80-83) says "On this code the host must NOT free the callback context immediately" only on the non-clean branch — implying that on `Success` the host MAY free it.

  Today's only consumer (Swift `deinit`) always follows stop with destroy, and destroy's `manager.shutdown()` drains orphans, so no UAF currently reaches a host. The contract gap is still worth closing symmetrically with the other two quiesce-then-touch-host-context teardown paths.

In `packages/rs-dash-async/src/registry.rs`:
- [SUGGESTION] packages/rs-dash-async/src/registry.rs:406-414: start_thread's out-of-lock prior reap races shutdown's orphan reap
  `start_thread` takes the prior handle under the slot lock, releases the lock at line 406, then calls `self.reap_prior_or_park(prior, key)` synchronously at line 413. `reap_prior_or_park` for an OS thread (registry.rs:737-762) spins up to `reap_backstop` (1 s in the wallet) before pushing a still-wedged handle into orphans. The `closing` latch and shutdown's tier snapshot only synchronize on the slot lock — they do not block on this out-of-lock reap. Interleaving from different threads:

  1. A: `start_thread` takes the slot lock, passes the `closing` check, takes the wedged `prior`, installs the new gen handle, releases the lock. A is now spinning in `reap_prior_or_park`.
  2. B: `shutdown()` takes the slot lock, sets `closing=true`, snapshots tiers (sees only the NEW gen handle — `prior` was already moved out), releases.
  3. B drains the new gen via `quiesce` (clean).
  4. B calls `reap_orphans_impl(reap_backstop)` — orphan list is still empty because A hasn't pushed yet.
  5. B returns a clean `ShutdownReport` → `CoordinatorExitStatus::all_clean()` → FFI `destroy` returns `ok()` → host frees its callback context.
  6. A's 1 s backstop expires; A pushes the still-live wedged prior into orphans and returns.

  The prior thread is now a live orphan holding `Arc`s to the freed host context — the same UAF surface F1 was designed to close. Requires (a) a wedged-past-backstop prior, (b) `start_thread` invoked concurrently with `shutdown` from a different thread, and (c) shutdown's reap_orphans phase to land in A's 1 s backstop window — narrow, but the asymmetric to the closing-latch protection the PR explicitly adds. Close by parking `prior` immediately under the slot lock and reaping asynchronously, or by tracking in-flight reaps so `reap_orphans_impl` waits for them to settle.

Comment thread packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs Outdated
Comment thread packages/rs-platform-wallet-ffi/src/shielded_sync.rs
Comment thread packages/rs-dash-async/src/registry.rs
@lklimek

lklimek commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

@coderabbitai review all

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

@lklimek Re-reviewing the full PR against the latest tip now.

✅ Action performed

Full review finished.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (1)
packages/rs-platform-wallet-ffi/src/manager.rs (1)

363-365: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Update the inline comment to match the new FFI behavior.

Lines 363-365 still say destroy will "just log it and drop it", but Lines 367-382 now surface non-clean shutdowns as ErrorShutdownIncomplete. Leaving the old comment here is misleading in the callback-lifetime path this PR is hardening.

✏️ Suggested edit
-        // It now joins the coordinator OS threads and returns their
-        // per-thread exit status; the C ABI exposes none of that, so we
-        // just log it (a panicked loop is worth surfacing) and drop it.
+        // It now joins the coordinator OS threads and returns their
+        // per-thread exit status. The C ABI does not expose that full
+        // structure directly, but a non-clean aggregate is surfaced as
+        // `ErrorShutdownIncomplete` so the host can keep its callback
+        // context alive until lingering workers settle.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/rs-platform-wallet-ffi/src/manager.rs` around lines 363 - 365, The
inline comment near the destroy/callback-lifetime handling in manager::destroy
is outdated and still says the shutdown result is only logged and dropped.
Update that comment to describe the current FFI behavior in destroy and the
related callback-lifetime path: the coordinator OS threads are joined, and
non-clean shutdowns are surfaced as ErrorShutdownIncomplete instead of being
silently discarded. Keep the wording aligned with the destroy logic and
ErrorShutdownIncomplete handling so the comment matches the implemented shutdown
semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/rs-dash-async/src/registry.rs`:
- Around line 674-679: The shutdown path in `Registry::shutdown` is discarding
the orphan result status from `reap_orphans_impl()`, which can hide non-clean
exits. Update `ShutdownReport` and its construction so the orphan status is
preserved alongside `detached`, and adjust `ShutdownReport::all_clean()` to
consider that stored status instead of treating parked orphans as clean by
default.

In `@packages/rs-platform-wallet/src/changeset/core_bridge.rs`:
- Around line 68-110: When the cancel branch in core_bridge.rs wins the
tokio::select! race, the loop exits immediately and can drop buffered broadcast
events, so update the cancellation handling in the main event loop to drain
receiver before breaking. Use the existing build_core_changeset and
persister.store flow for each pending event from receiver.try_recv(), preserve
the core.is_empty_no_records() skip, and keep the existing error/warn logging
style keyed by wallet_id so shutdown still persists any queued CoreChangeSet
updates.

In `@packages/rs-platform-wallet/src/manager/identity_sync.rs`:
- Around line 407-410: Update the `stop()` documentation in `identity_sync.rs`
to reflect that background sync passes are cancellable and may exit early when
shutdown is requested. Adjust the wording around `sync_now()`/`tokio::select!`
in `stop()` and any related comments so it no longer claims a pass will always
run to completion after `stop()`, and make the doc mention that cancellation can
interrupt an in-flight `.await`.

In `@packages/rs-platform-wallet/src/manager/mod.rs`:
- Around line 515-522: The `clear_shielded`/`shielded_sync_start` path in
`manager::mod` still relies on a host-side precondition, but `start()` can race
`clear()` and re-persist into a wiped store. Add a per-key clearing latch in the
wallet manager/registry logic, have `shielded_sync_start` honor it, and keep it
held across quiesce, liveness check, and `coord.clear()` so the clear/start
exclusion is enforced inside the wallet rather than by the caller.

In `@packages/rs-platform-wallet/src/manager/platform_address_sync.rs`:
- Around line 206-210: Update the `stop()` documentation in
`PlatformAddressSync` to reflect that cancellation can win over `sync_now()` in
the `tokio::select!` loop, so a background pass may exit before reaching
`on_platform_address_sync_completed`. Adjust the docs/comments near `stop()`,
`sync_now()`, and the select loop to clearly state that shutdown does not
guarantee completion of an in-flight sync pass.

In `@packages/rs-platform-wallet/src/manager/shielded_sync.rs`:
- Around line 226-230: Update the stop() documentation to reflect that
background-pass cancellation can interrupt an in-flight sync_now(false) call
rather than guaranteeing the persister fan-out completes. In shielded_sync.rs,
adjust the docs/comments around the tokio::select! path in
ShieldedSyncManager::stop and the background coordinator.sync() flow to state
that cancellation may win the race and the sync can be dropped at an await. Keep
the description aligned with the actual behavior of sync_now(false) and the
cancel.cancelled() branch.

---

Nitpick comments:
In `@packages/rs-platform-wallet-ffi/src/manager.rs`:
- Around line 363-365: The inline comment near the destroy/callback-lifetime
handling in manager::destroy is outdated and still says the shutdown result is
only logged and dropped. Update that comment to describe the current FFI
behavior in destroy and the related callback-lifetime path: the coordinator OS
threads are joined, and non-clean shutdowns are surfaced as
ErrorShutdownIncomplete instead of being silently discarded. Keep the wording
aligned with the destroy logic and ErrorShutdownIncomplete handling so the
comment matches the implemented shutdown semantics.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ef1d0648-a464-44ab-8151-583159ad51f5

📥 Commits

Reviewing files that changed from the base of the PR and between 87f6b73 and 911f99f.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (20)
  • packages/rs-dash-async/Cargo.toml
  • packages/rs-dash-async/src/atomic.rs
  • packages/rs-dash-async/src/lib.rs
  • packages/rs-dash-async/src/registry.rs
  • packages/rs-platform-wallet-ffi/Cargo.toml
  • packages/rs-platform-wallet-ffi/src/error.rs
  • packages/rs-platform-wallet-ffi/src/manager.rs
  • packages/rs-platform-wallet-ffi/src/shielded_sync.rs
  • packages/rs-platform-wallet/Cargo.toml
  • packages/rs-platform-wallet/src/changeset/core_bridge.rs
  • packages/rs-platform-wallet/src/changeset/mod.rs
  • packages/rs-platform-wallet/src/error.rs
  • packages/rs-platform-wallet/src/lib.rs
  • packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs
  • packages/rs-platform-wallet/src/manager/identity_sync.rs
  • packages/rs-platform-wallet/src/manager/mod.rs
  • packages/rs-platform-wallet/src/manager/platform_address_sync.rs
  • packages/rs-platform-wallet/src/manager/shielded_sync.rs
  • packages/swift-sdk/Sources/SwiftDashSDK/PlatformWallet/PlatformWalletManager.swift
  • packages/swift-sdk/Sources/SwiftDashSDK/PlatformWallet/PlatformWalletResult.swift

Comment thread packages/rs-dash-async/src/registry.rs Outdated
Comment thread packages/rs-platform-wallet/src/changeset/core_bridge.rs
Comment thread packages/rs-platform-wallet/src/manager/identity_sync.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/platform_address_sync.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/shielded_sync.rs Outdated

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

All three prior findings (911f99f) are resolved at 748c4f8: CoordinatorLifecycle::quiesce now self-raises the quiescing gate and drains in-flight direct passes (coordinator_lifecycle.rs:166-173, with a regression test); the shielded_sync_stop FFI now consults manager.shielded_worker_alive() and returns ErrorShutdownIncomplete when a prior-generation shielded orphan is parked (shielded_sync.rs:127-157); and start_thread now parks the prior worker under the slot lock via park_prior_locked before the out-of-lock reap, with a regression test in the registry. One in-scope cumulative concern remains: the shared quiescing AtomicBool means a concurrent quiesce()'s AtomicFlagGuard drop can lower the gate that clear_shielded's hold_quiescing_gate is supposed to keep raised continuously — the mechanism does not match the doc's continuous-hold promise in clear_shielded_inner. In practice this is gated by the host's UI-thread serialization, so it is a suggestion, not a blocker.

🟡 1 suggestion(s)

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs`:
- [SUGGESTION] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:166-173: clear_shielded's "continuously held" gate can be lowered by a concurrent quiesce()
  `clear_shielded_inner` holds `_clearing_gate = shielded_sync_manager.hold_quiescing_gate()` across drain → liveness check → wipe, expecting the shielded `quiescing` boolean to stay `true` for the whole sequence (and the doc at manager/mod.rs:517-520 explicitly promises "the quiescing gate is raised continuously" so concurrent direct `sync_now`/`sync_wallet` are held off). But the same `AtomicBool` backs `CoordinatorLifecycle::quiesce`, which builds its own `AtomicFlagGuard` (coordinator_lifecycle.rs:171); `AtomicFlagGuard::drop` unconditionally stores `false` (rs-dash-async/src/atomic.rs:25-29). If `platform_wallet_manager_shielded_sync_stop` is invoked concurrently with `clear_shielded` (the FFI handle store only takes a brief read lock in `with_item`, so two calls on the same handle can interleave), the stop's `quiesce()` returns and its dropping guard clears the flag while clear is still between its drain and its wipe. A direct `sync_now`/`sync_wallet` then sees `quiescing=false` in `begin_pass`, claims `is_syncing`, and can re-persist into the store `coord.clear()` is about to wipe — a store/host desync. The host's UI-thread serialization mitigates this today, but the documented invariant on `clear_shielded` only excludes concurrent `shielded_sync_start`, not `_stop`. Fix options: refcount the gate (so the inner drop doesn't lower it under the outer hold), use a clear-specific second flag that `begin_pass` also consults, or tighten the host-serialization doc to enumerate `_stop` too.

Comment thread packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs Outdated
Comment thread packages/rs-dash-async/src/registry.rs Outdated
Comment thread packages/rs-dash-async/src/registry.rs Outdated
Comment thread packages/rs-dash-async/src/registry.rs
Comment thread packages/rs-dash-async/src/registry.rs
Comment thread packages/rs-dash-async/src/registry.rs

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Incremental review of 748c4f8..6ed5200. Latest delta adds the server-advertised rate-limit ban path in rs-dapi-client + dashmate gateway wiring, simplifies platform_wallet_manager_shielded_sync_stop to cancel-only, drains buffered wallet events on cancel in event-adapter (T7), and threads orphan_status through ShutdownReport. No new blocking defects found. The prior architectural finding around clear_shielded's continuously-held gate vs. CoordinatorLifecycle::quiesce on the same AtomicBool is STILL VALID but now latent: ShieldedSyncManager::quiesce remains pub on the same atomic, however no FFI export or in-tree Rust caller races it against clear_shielded (verified). Carried forward as suggestion only. Model coverage degraded: Codex (general/rust-quality/ffi-engineer) did NOT run this round due to ACP/native auth stall; review is Claude-only.

Model coverage note: Claude general, Claude rust-quality, and Claude ffi-engineer ran. Codex general/rust-quality/ffi-engineer did not run because ACP-Codex authentication stalled during the authenticate handshake and native Codex CLI returned 401/missing bearer; this review is therefore degraded from the usual Claude+Codex matrix.

🟡 1 suggestion(s)

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs`:
- [SUGGESTION] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:166-249: Public ShieldedSyncManager::quiesce can still lower clear_shielded's continuously-held gate (carry-forward of prior 748c4f82-1)
  `clear_shielded_inner` (manager/mod.rs:548) holds `_clearing_gate = shielded_sync_manager.hold_quiescing_gate()` across the entire drain → liveness-check → store-wipe sequence, expecting the shielded `quiescing` boolean to stay raised so no direct `sync_now`/`sync_wallet` slips past and re-persists into the store being wiped (mod.rs docs 515-524). The hold guard at coordinator_lifecycle.rs:243-249 and the RAII guard inside `CoordinatorLifecycle::quiesce` at line 171 both wrap the SAME `Arc<AtomicBool>`, and `AtomicFlagGuard::drop` unconditionally `store(false, Release)`s. A concurrent invocation of `ShieldedSyncManager::quiesce` (which is still `pub` at shielded_sync.rs:280) would lower the gate mid-Clear, opening exactly the window the docstring promises is closed. 

  Reachability on this head: verified there is no FFI export that triggers this — `platform_wallet_manager_shielded_sync_stop` is cancel-only in this delta, `destroy` goes through the registry shutdown path whose drain hook only stores `true` with no RAII guard (lines 121-131), and `shielded_clear` uses `quiesce_under_held_gate`. The only in-tree caller of `shielded_sync().quiesce()` is the internal test at coordinator_lifecycle.rs:346. So the cross-boundary exploit path is closed by surface; the residual is a fragile in-process Rust contract that future callers can break silently. The module docs at mod.rs:515-524 already acknowledge this as a follow-up (per-key clearing latch / refcount on the registry). Suggested cheap structural fixes: narrow `ShieldedSyncManager::quiesce` to `pub(crate)` (no FFI consumer today), refcount the gate, or force every caller through `quiesce_under_held_gate` after `hold_quiescing_gate`. Non-blocking — flagged so the latent fragility does not regrow once new shielded entry points are added.

Comment thread packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs Outdated

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Two blocking issues found at HEAD 284752a. (1) ShieldedSyncManager::start()'s spawn_periodic_loop calls reopen_quiescing_gate() before start_thread's clearing-latch check — a racing FFI shielded_sync_start during clear_shielded silently lowers the clear's continuously-held quiescing gate, partially defeating the new latch. (2) The Swift ShieldedService.clearLocalState catches clearShielded()'s ShieldedShutdownIncomplete error and continues to wipe SwiftData, violating the explicit PlatformWalletResult contract ("must NOT commit its own persistence wipe — the Rust store was left intact"). Prior carry-forward finding (public ShieldedSyncManager::quiesce can lower the held gate) is STILL VALID as a latent suggestion. Several lower-severity issues around the ClearingGuard's set-membership semantics and its test coverage are kept as suggestions/nitpicks.

🔴 2 blocking | 🟡 1 suggestion(s) | 💬 1 nitpick(s)

1 additional finding(s) omitted (not in diff).

1 carried-forward finding(s) already raised on this PR; not re-posting as new inline comments.

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:131-161: Refused shielded (re)start still lowers clear_shielded's continuously-held quiescing gate
  `spawn_periodic_loop` calls `self.reopen_quiescing_gate()` at line 137 (which `store(false, Release)`s `quiescing`) before invoking `registry.start_thread(...)` at line 142. The new per-key clearing latch correctly refuses the start inside `start_thread` (registry.rs:414-416), but by then the gate has already been lowered. `clear_shielded_inner` (manager/mod.rs:390-415) holds the gate up via `hold_quiescing_gate()` precisely to keep direct `sync_now`/`sync_wallet` callers blocked across drain → liveness-check → wipe, and a single shared `AtomicFlagGuard` always clears the flag on drop — the clear path has no way to re-raise it. So an FFI `platform_wallet_manager_shielded_sync_start` (exported at rs-platform-wallet-ffi/src/shielded_sync.rs:57) landing during a clear will: (a) call `ShieldedSyncManager::start()` → `spawn_periodic_loop` → `reopen_quiescing_gate()` setting `quiescing=false`, (b) be refused by the clearing latch at the registry, (c) leave the gate down for the remainder of the clear. A direct `sync_now`/`sync_wallet` going through `begin_pass` then observes `quiescing=false` and slips past the barrier, re-persisting into the store the clear is about to wipe — the exact race the new latch was added to close. Fix by checking the clearing latch BEFORE reopening the gate (e.g. early-return from `spawn_periodic_loop` if `registry.is_clearing(worker)`), or ordering the gate reopen so it only fires when the registry actually accepts the start (e.g. inside the closure passed to `start_thread`, or only after a non-rejecting outcome). The same hazard exists structurally for the identity-sync and platform-address coordinators if they ever gain a clearing latch.

In `packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift`:
- [BLOCKING] packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift:608-628: Swift clear flow wipes SwiftData rows after Rust reports the shielded store was left intact
  `platform_wallet_manager_shielded_clear` maps `PlatformWalletError::ShieldedShutdownIncomplete` to `ErrorShutdownIncomplete` and returns BEFORE touching the Rust shielded store (rs-platform-wallet-ffi/src/shielded_sync.rs:446-461). The Swift error contract states this explicitly: `PlatformWalletResult.swift:188-193` says on `shutdownIncomplete` "the host ... must NOT commit its own persistence wipe — the Rust store was left intact so it can be retried once the pass settles." But `ShieldedService.clearLocalState` catches the thrown error, logs it via `SDKLogger.error`, and falls through into `modelContext.delete(model: PersistentShieldedNote/OutgoingNote/SyncState/Activity)` at lines 624-627. If Rust refused the clear because a coordinator didn't drain in time, SwiftData rows are deleted while the Rust commitment tree remains populated — the exact cross-language desync the new FFI result code was added to prevent. The legacy `// Best-effort — failure logs but doesn't abort the wipe` comment at lines 606-607 predates this contract and is now wrong. The catch must distinguish `.shutdownIncomplete` from other errors and return early without wiping host rows.

In `packages/rs-dash-async/src/registry.rs`:
- [SUGGESTION] packages/rs-dash-async/src/registry.rs:596-613: ClearingGuard uses set-membership semantics — nested or overlapping guards release the latch on the first drop
  `hold_clearing` inserts the key into a `BTreeSet<K>` and `ClearingGuard::drop` removes it unconditionally. With two live `ClearingGuard`s for the same key (nested call by future code, OR two concurrent FFI `platform_wallet_manager_shielded_clear` invocations on the same handle — `HandleStorage::with_item` only holds a read lock), the first guard to drop removes the set entry and the surviving guard's protection lapses silently. Today only `clear_shielded_inner` takes the latch and the test `hold_clearing_nested_guards_share_the_latch` (registry.rs:1857) documents the quirk via comment — but the public `pub fn hold_clearing` signature gives no hint to callers, and a concurrent host-thread clear is reachable through the FFI surface. Convert the backing store to `BTreeMap<K, NonZeroUsize>` so guards refcount/compose, or serialize `clear_shielded` per key at the manager boundary so re-entrant clears can't overlap.

Comment thread packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs
Comment thread packages/rs-dash-async/src/registry.rs
Comment thread packages/rs-dash-async/src/registry.rs Outdated

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Latest delta (97e373b) only adds nonisolated(unsafe) to two Swift handler properties for Swift 6 strict-concurrency compliance — that change is narrow and sound. All five prior findings remain unaddressed at HEAD, including two blocking issues: (1) spawn_periodic_loop unconditionally lowers the shared quiescing gate before the per-key clearing latch can refuse the (re)start, opening a window where a direct sync_now/sync_wallet can persist into the store being cleared, and (2) Swift ShieldedService.clearLocalState wipes SwiftData rows even when Rust returns ErrorShutdownIncomplete — the comment now explicitly codifies 'best-effort, failure logs but doesn't abort the wipe', which is the opposite of the documented FFI contract.

🔴 2 blocking | 🟡 1 suggestion(s)

1 additional finding(s) omitted (not in diff).

2 carried-forward finding(s) already raised on this PR; not re-posting as new inline comments.

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:131-161: [prior-1, carried forward] `spawn_periodic_loop` lowers the shared quiescing gate before the per-key clearing latch can refuse the (re)start
  `spawn_periodic_loop` calls `self.reopen_quiescing_gate()` at line 137 — an unconditional `quiescing.store(false, Release)` on the `Arc<AtomicBool>` — BEFORE handing off to `registry.start_thread(...)` at line 142, where the per-key `hold_clearing(ShieldedSync)` latch refuses the start. During `clear_shielded_inner`, the clear flow continuously holds both the registry clearing latch and the quiescing gate (via `hold_quiescing_gate`, which is a non-restoring RAII guard). Race: (T1) clear holds latch + gate=true; (T2) host calls `shielded_sync_start()` → `spawn_periodic_loop` → `reopen_quiescing_gate` writes `false` → `start_thread` is refused by the latch, but the gate is already down; (T3) a direct `sync_now`/`sync_wallet` enters `begin_pass`, observes `quiescing == false`, takes the `is_syncing` slot, and runs to completion — persisting into the shielded store inside the clear-then-wipe window. The `quiesce_under_held_gate` debug_assert at line 219-222 catches the inconsistency only in debug builds; release builds proceed silently. Fix: only reopen the gate once `start_thread` confirms acceptance — either consult `lock_clearing` in `spawn_periodic_loop` and skip the reopen on refusal, move the reopen inside the registry's slot critical section after the latch check, or use the registry's accept signal to gate the store.
- [SUGGESTION] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:196-203: [prior-3, carried forward] Public `quiesce()` drops an `AtomicFlagGuard` on the same shared atomic that `clear_shielded` holds raised
  `CoordinatorLifecycle::quiesce` constructs `_quiescing_gate = AtomicFlagGuard::new(&self.quiescing)` over the same `Arc<AtomicBool>` that `clear_shielded_inner` holds raised via `hold_quiescing_gate`. `AtomicFlagGuard::Drop` unconditionally writes `false`, so any concurrent caller routed through the public `quiesce()` while a clear is in flight will lower the clear-flow's continuously-held gate. Today the wallet's `shutdown` routes through `registry.shutdown()` rather than this method, but `ShieldedSyncManager::quiesce` remains `pub` and reachable via the `shielded_sync_arc` accessor — the trap is one accidental call away from re-opening the same lapse Clear was hardened to prevent. The structural fix is to give the gate a refcount/holder-aware re-raise (so concurrent owners compose), or to gate the public `quiesce` against the clearing latch and assert no other holder is active.

In `packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift`:
- [BLOCKING] packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift:606-628: [prior-2, carried forward] `clearLocalState` wipes SwiftData rows even when `clearShielded()` throws `ErrorShutdownIncomplete` — violates the documented FFI contract
  `PlatformWalletFFIResultCode::ErrorShutdownIncomplete` is the explicit signal that the Rust shielded store was left intact: the FFI doc in `packages/rs-platform-wallet-ffi/src/error.rs` states 'the manager is NOT torn down and the store was left intact (Clear aborted before touching it). The host may retry the clear, and must not commit its own persistence wipe — doing so would desync the host's rows from the still-populated shared tree.' The current implementation at lines 608-616 catches the thrown error, logs it via `SDKLogger.error`, then falls through to step 2 (lines 623-628), unconditionally deleting `PersistentShieldedNote`, `PersistentShieldedOutgoingNote`, `PersistentShieldedSyncState`, and `PersistentShieldedActivity`. The comment at line 606-607 even codifies this: 'Best-effort — failure logs but doesn't abort the wipe.' That is the exact split-brain the FFI contract was designed to prevent: when Rust correctly preserves notes (orphan-alive backstop, drain timeout, non-clean drain status), Swift discards its mirror, and the next `bindShielded` + sync repopulates host rows from the surviving Rust state — the user-visible 'Clear' becomes a no-op on the Rust side while host data is lost. Minimum fix: inspect the thrown error code and short-circuit the SwiftData wipe (surfacing the error to the user / scheduler) on `errorShutdownIncomplete`.

Comment thread packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs
Comment thread packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Carried forward from prior review (still valid at bcf9388): prior-2 (blocking — Swift clearLocalState still wipes SwiftData on ErrorShutdownIncomplete), prior-3 (suggestion — public quiesce() drops AtomicFlagGuard on the shared atomic), prior-4 (suggestion — ClearingGuard uses set membership), prior-5 (nitpick — nested-guards test doesn't exercise the documented lapse). prior-1 is partially addressed: the new is_clearing short-circuit + SEC-002 regression test close the common race, but the check and the gate-write are still two distinct critical sections, leaving a narrow residual TOCTOU window — carried forward as a suggestion. No new defects identified in the latest delta (rs-platform-wallet-storage / secrets / keyless rehydration look careful: fail-closed topology check, transactional rollback, per-row skip taxonomy, structured FFI outcome).

🔴 1 blocking | 🟡 3 suggestion(s) | 💬 1 nitpick(s)

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift`:
- [BLOCKING] packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift:608-616: [prior-2, carried forward] `clearLocalState` wipes SwiftData rows even when `clearShielded()` throws `ErrorShutdownIncomplete` — violates the documented FFI contract
  The Rust FFI contract for `ErrorShutdownIncomplete` on `platform_wallet_manager_shielded_clear` (rs-platform-wallet-ffi/src/error.rs around the `ShieldedShutdownIncomplete → ErrorShutdownIncomplete` mapping, plus shielded_sync.rs:454) is explicit: when the manager surfaces this code, the Rust shielded store was left intact because Clear aborted before touching it (drain non-clean / orphan-alive backstop). The host must retry the clear and must NOT commit its own persistence wipe; doing so desyncs host rows from the still-populated shared tree. ShieldedService.swift:608-616 catches the throw, logs via `SDKLogger.error`, then falls through unconditionally into the SwiftData delete block at 623-628 (`PersistentShieldedNote`, `PersistentShieldedOutgoingNote`, `PersistentShieldedSyncState`, `PersistentShieldedActivity`). The comment at 606-607 ("Best-effort — failure logs but doesn't abort the wipe") codifies the exact split-brain the FFI contract was designed to prevent. The Swift surface already distinguishes this code: `PlatformWalletResultCode.errorShutdownIncomplete` exists (PlatformWalletResult.swift:47) and decodes into `PlatformWalletError.shutdownIncomplete(detail)` (line 240), so the host can branch on it. Minimum fix: on the typed `shutdownIncomplete` case (or any thrown error from `clearShielded()`, since the Rust contract is fail-closed there), short-circuit the SwiftData wipe and surface the error to the user instead of proceeding.

In `packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs`:
- [SUGGESTION] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:137-148: [prior-1, partial fix landed] Residual TOCTOU between `is_clearing` check and `reopen_quiescing_gate`
  The new `is_clearing(self.worker)` short-circuit at line 145 closes the common race (clear is already holding the latch when spawn arrives), and the SEC-002 regression test in mod.rs:1071-1108 covers that ordering. But the check (line 145) and the gate-lower (line 148) are not done under a single critical section: `is_clearing` takes `lock_clearing`, returns the boolean, releases the lock; then `reopen_quiescing_gate` writes `false` to the shared atomic. If `clear_shielded_inner` acquires the latch AND raises the quiescing gate between those two points, the reopen at 148 lowers a gate Clear believes it holds continuously across drain → liveness → wipe. In that window a direct `sync_now`/`sync_wallet` going through `begin_pass` (lines 306-341) observes `quiescing == false`, claims the `is_syncing` slot, runs the pass to completion, and persists into the store about to be wiped. `quiesce_under_held_gate`'s `debug_assert!` (line 230) catches the inconsistency only in debug builds; release silently proceeds with the drain. The window is microseconds-narrow, but the design invariant is continuous gate-hold. Structural fix: have the registry expose an atomic observe-and-act under one `lock_clearing()` scope (e.g. `with_clearing_check`) so the gate write is sequenced with the latch read, or move the gate write inside the registry slot critical section alongside the latch check.
- [SUGGESTION] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:207-214: [prior-3, carried forward] Public `quiesce()` drops an `AtomicFlagGuard` on the same shared atomic that `clear_shielded` holds raised
  `CoordinatorLifecycle::quiesce` constructs `_quiescing_gate = AtomicFlagGuard::new(&self.quiescing)` over the same `Arc<AtomicBool>` that `clear_shielded_inner` holds raised via `hold_quiescing_gate`. `AtomicFlagGuard::Drop` unconditionally writes `false`, so any concurrent caller routed through the public `quiesce()` while a clear is in flight will lower the clear-flow's continuously-held gate. `ShieldedSyncManager::quiesce` remains `pub` and reachable through the manager's shielded accessor — one accidental call away from re-opening the same lapse Clear was hardened to prevent. Structural fix: refcount/owner-aware re-raise (so concurrent owners compose), or gate the public `quiesce` against the clearing latch and assert no other holder is active before its guard drop is allowed to fire.

In `packages/rs-dash-async/src/registry.rs`:
- [SUGGESTION] packages/rs-dash-async/src/registry.rs:596-613: [prior-4, carried forward] `ClearingGuard` uses set-membership — overlapping guards release the latch on the first drop
  `hold_clearing` does `lock_clearing().insert(key)` into a `BTreeSet<K>`, and `ClearingGuard::Drop` (lines 1099-1103) unconditionally removes the key. With two live guards for the same key (a nested or overlapping clear from a future caller, a test, or an accidental shielded re-clear before the first guard drops), the first drop removes the key while the second guard is still alive — silently lapsing the latch protection the still-live guard is supposed to provide. The new `is_clearing()` check (line 620) inherits the same accounting weakness because it reads the same set membership. The test at lines 1859-1883 acknowledges this in its comment but does not fix the type-level mismatch between the RAII guard's compositional appearance and its set-based semantics. Minimum fix: refcount per key (`BTreeMap<K, NonZeroUsize>`), or add a `debug_assert!` that the insert was novel so accidental nesting fails loudly in debug.

Note: GitHub refused to serve this PR's diff via gh pr diff because it exceeds the 20,000-line limit, so this exact-SHA review is posted body-only. Existing inline threads are reused rather than duplicated.

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Latest delta resolves prior-4 and prior-5: ClearingGuard now refcounts via BTreeMap<K, NonZeroUsize> so nested/concurrent guards compose, and the new test hold_clearing_inner_drop_does_not_lapse_outer_protection is non-vacuous and asserts the lapse window. prior-2 (Swift ShieldedService wipe-on-error) remains unchanged and blocking. prior-3 is partially addressed by the new is_clearing bail in public quiesce(), but the same check-then-act TOCTOU pattern from prior-1 now exists at both spawn_periodic_loop (lines 145-148) and quiesce (lines 217-224); collapsed into a single structural finding.

🔴 1 blocking | 🟡 1 suggestion(s)

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift`:
- [BLOCKING] packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift:608-616: [prior-2, carried forward] clearLocalState wipes SwiftData rows even when clearShielded() throws — violates the FFI contract
  Unchanged at 6b655aab. The Rust FFI contract for `platform_wallet_manager_shielded_clear` surfaces `ErrorShutdownIncomplete` specifically when the Rust shielded store was left intact (Clear aborted before touching it via the drain/orphan-alive backstop). The host MUST NOT commit its own persistence wipe in that case, or SwiftData rows desync from the still-populated shared tree — the exact split-brain this stack is designed to prevent.

  Lines 608-616 catch the throw, log via `SDKLogger.error`, then fall through unconditionally into the SwiftData delete block at 623-628 (`PersistentShieldedNote`, `PersistentShieldedOutgoingNote`, `PersistentShieldedSyncState`, `PersistentShieldedActivity`). The comment at 606-607 ("Best-effort — failure logs but doesn't abort the wipe") codifies the wrong invariant. The Swift surface already distinguishes this code path: `PlatformWalletResultCode.errorShutdownIncomplete` decodes into `PlatformWalletError.shutdownIncomplete(detail)`, so the host can branch on it. Minimum fix: on any thrown error from `clearShielded()` (since the Rust contract is fail-closed there), short-circuit the SwiftData wipe and surface the error to the user.

In `packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs`:
- [SUGGESTION] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:145-224: [prior-1 + residual prior-3] Check-then-act TOCTOU around the shared `quiescing` atomic at both `spawn_periodic_loop` and `quiesce` call sites
  Same structural pattern at two call sites — neither performs the `is_clearing` check atomically with the subsequent write to the shared `quiescing` atomic.

  1. `spawn_periodic_loop` (145-148): `is_clearing(self.worker)` at 145 takes `lock_clearing`, returns the boolean, releases the lock; `reopen_quiescing_gate()` at 148 then writes `false` to the shared atomic. If `clear_shielded_inner` acquires the clearing latch AND raises the gate between those two operations, this `reopen_quiescing_gate()` lowers a gate Clear believes it holds continuously across drain → liveness → wipe — opening a window for a direct `sync_now`/`sync_wallet` to pass `begin_pass` and persist into the store about to be wiped (even though `start_thread` itself will subsequently be refused by the latch).

  2. `quiesce` (217-224): the new fix to bail when `is_clearing` is true closes the dominant case (clear already holds the latch), but introduces the same race shape — `is_clearing` at 217 → unconditional `self.quiescing.store(true, ...)` + `AtomicFlagGuard::new(&self.quiescing)` at 223-224, whose `Drop` (`atomic.rs:25-29`) unconditionally writes `false`. If `clear_shielded_inner` acquires the latch + `hold_quiescing_gate` in the sub-microsecond window between 217 and 223, the racing quiesce installs an `AtomicFlagGuard` over the same atomic the in-flight clear is holding raised; on the quiesce's return the guard drops and lowers the gate, lapsing Clear's continuously-held barrier. The new regression test `shielded_quiesce_during_clear_preserves_quiescing_gate` only exercises the ordering where the clearing latch is acquired BEFORE the quiesce arrives.

  Structural fix (resolves both call sites): expose an atomic observe-and-act on the registry under one `lock_clearing()` scope (e.g. a `with_clearing_check(F)` callback that runs F only while the clearing-set lock is held), or move the gate write inside the registry slot's critical section alongside the latch check. Alternatively, refcount the `quiescing` atomic so concurrent owners compose like `ClearingGuard` now does post-prior-4.

Note: Posted body-only because normal inline poster failed: command failed 1: python3 scripts/review_poster.py dashpay/platform 3954 6b655aa --dry-run.

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Carried-forward prior findings dominate this incremental review. prior-2 (Swift host wipes SwiftData rows on a failed Rust shielded clear) is STILL VALID and remains blocking — the catch at ShieldedService.swift:611-615 still logs-and-falls-through into the SwiftData delete block, contradicting the Rust FFI's fail-closed ErrorShutdownIncomplete contract. prior-1+prior-3 (TOCTOU between is_clearing and the shared quiescing atomic at both spawn_periodic_loop:145-148 and quiesce:217-224) is STILL VALID as a suggestion. The latest delta (e46249f) only adds a one-shot panic=abort startup warn on with_reap_backstop, doc back-references on AtomicFlagGuard/EpilogueGuard, and two cfg-pinned tests — verified correct, no new defects.

🔴 1 blocking | 🟡 1 suggestion(s)

Findings not posted inline (2)

These findings could not be anchored to the current diff, but they are still part of this review.

  • [BLOCKING] packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift:608-616: [prior-2, carried forward] clearLocalState wipes SwiftData rows even when clearShielded() throws — violates the fail-closed FFI contract — Unchanged at e46249f. The Rust FFI for platform_wallet_manager_shielded_clear surfaces ErrorShutdownIncomplete specifically when the Rust shielded store was left intact (Clear aborted before touching it via the drain / orphan-alive backstop). The host must not commit its own persistence wipe...
  • [SUGGESTION] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:145-224: [prior-1+prior-3, carried forward] Check-then-act TOCTOU between is_clearing and the shared quiescing atomic at both spawn_periodic_loop and quiesce — Unchanged at e46249f. Same structural pattern at two call sites — neither performs the is_clearing check atomically with the subsequent write to the shared quiescing atomic. 1. spawn_periodic_loop (145-148): is_clearing(self.worker) at 145 takes lock_clearing, returns the boolean, rele...
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift`:
- [BLOCKING] packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift:608-616: [prior-2, carried forward] clearLocalState wipes SwiftData rows even when clearShielded() throws — violates the fail-closed FFI contract
  Unchanged at e46249fe. The Rust FFI for `platform_wallet_manager_shielded_clear` surfaces `ErrorShutdownIncomplete` specifically when the Rust shielded store was left intact (Clear aborted before touching it via the drain / orphan-alive backstop). The host must not commit its own persistence wipe in that case, or the SwiftData rows desync from the still-populated shared tree — the exact split-brain this stack is engineered to prevent.

  Lines 608-616 catch the throw, log via `SDKLogger.error`, then fall through unconditionally into the SwiftData delete block at 623-628 (`PersistentShieldedNote`, `PersistentShieldedOutgoingNote`, `PersistentShieldedSyncState`, `PersistentShieldedActivity`). The comment at 606-607 ("Best-effort — failure logs but doesn't abort the wipe") codifies the wrong invariant for a fail-closed FFI. The Swift surface already distinguishes this code path: `PlatformWalletResultCode.errorShutdownIncomplete` decodes into `PlatformWalletError.shutdownIncomplete(detail)`, so the host can branch on it.

  Minimum fix: on any thrown error from `clearShielded()`, short-circuit the SwiftData wipe, set `lastError` to a retry-actionable message, and return so the user can retry Clear once Rust quiesces.

In `packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs`:
- [SUGGESTION] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:145-224: [prior-1+prior-3, carried forward] Check-then-act TOCTOU between `is_clearing` and the shared `quiescing` atomic at both `spawn_periodic_loop` and `quiesce`
  Unchanged at e46249fe. Same structural pattern at two call sites — neither performs the `is_clearing` check atomically with the subsequent write to the shared `quiescing` atomic.

  1. `spawn_periodic_loop` (145-148): `is_clearing(self.worker)` at 145 takes `lock_clearing`, returns the boolean, releases the lock; `reopen_quiescing_gate()` at 148 then writes `false` to the shared atomic. If `clear_shielded_inner` acquires the clearing latch AND raises the gate between those two operations, this `reopen_quiescing_gate()` lowers a gate Clear believes it holds continuously across drain → liveness → wipe — opening a window for a direct `sync_now`/`sync_wallet` to pass `begin_pass` and persist into the store about to be wiped (even though `start_thread` itself will subsequently be refused by the latch).

  2. `quiesce` (217-224): the new `is_clearing` early-return closes the dominant case (clear already holds the latch), but introduces the same race shape — `is_clearing` at 217 → unconditional `self.quiescing.store(true, ...)` + `AtomicFlagGuard::new(&self.quiescing)` at 223-224, whose Drop (`atomic.rs:25-29`) unconditionally writes `false`. If `clear_shielded_inner` acquires the latch + `hold_quiescing_gate` in the sub-microsecond window between 217 and 223, the racing quiesce installs an `AtomicFlagGuard` over the same atomic the in-flight clear is holding raised; on the quiesce's return the guard drops and lowers the gate, lapsing Clear's continuously-held barrier. The new regression test `shielded_quiesce_during_clear_preserves_quiescing_gate` only exercises the ordering where the clearing latch is acquired BEFORE the quiesce arrives.

  Structural fix (resolves both call sites): expose an atomic observe-and-act on the registry under one `lock_clearing()` scope (e.g. a `with_clearing_check(F)` callback that runs F only while the clearing-set lock is held), or move the gate write inside the registry slot's critical section alongside the latch check. Alternatively, refcount the `quiescing` atomic so concurrent owners compose like `ClearingGuard` now does post-prior-4.

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Both prior findings are FIXED at f424caf: the Swift fail-closed path now early-returns before the SwiftData wipe when clearShielded() throws (ShieldedService.swift:618), and the is_clearing↔quiescing TOCTOU is closed by switching to a refcounted Arc with composable RefcountedFlagGuard. One new blocking issue remains: the drain_hook's unconditional fetch_add is paired with a fetch_sub that only fires when registry.quiesce() returns a non-NotRunning status AND when cancel_join_and_drain runs to completion. Both agents independently flag this asymmetry; codex via the production-reachable timeout cancellation path that clear_shielded_inner wraps around quiesce_under_held_gate, claude via the registry's Superseded / inner-NotRunning return sites. Either path leaks a +1 into quiescing, leaving the shielded coordinator permanently gated so every future begin_pass bails.

🔴 1 blocking

Findings not posted inline (1)

These findings could not be anchored to the current diff, but they are still part of this review.

  • [BLOCKING] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:275-292: drain_hook's fetch_add is not cancellation-safe and not symmetric with all of registry.quiesce's return paths — quiescing can leak +1 and permanently gate the shielded coordinatordrain_hook unconditionally does quiescing.fetch_add(+1) (line 128) and is paired with a single conditional fetch_sub(-1) here, gated on status != WorkerStatus::NotRunning. That pairing assumes a strict invariant — "the registry fires the hook iff it eventually returns a non-NotRunning sta...
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:275-292: drain_hook's fetch_add is not cancellation-safe and not symmetric with all of registry.quiesce's return paths — quiescing can leak +1 and permanently gate the shielded coordinator
  `drain_hook` unconditionally does `quiescing.fetch_add(+1)` (line 128) and is paired with a single conditional `fetch_sub(-1)` here, gated on `status != WorkerStatus::NotRunning`. That pairing assumes a strict invariant — "the registry fires the hook iff it eventually returns a non-NotRunning status, and the wallet-side future always runs to completion past line 285." Neither half of that invariant holds:

  1. **Cancellation / timeout path (production-reachable).** `clear_shielded_inner` (manager/mod.rs:412-419) wraps `quiesce_under_held_gate()` in `tokio::time::timeout`. A wedged shielded background pass can keep the inner future suspended past line 702 of `registry.quiesce` (drain hook has already been awaited and `fetch_add(+1)` has run) while it polls the join loop. When the outer timeout elapses, the future is dropped, so the matching `fetch_sub` at line 286 never runs. The `_clearing_gate` in `clear_shielded_inner` then drops, releasing only its own contribution. Net result: `quiescing` stays at +1 for the lifetime of this `CoordinatorLifecycle`, so every subsequent `begin_pass` bails on the `quiescing.load() > 0` check at line 378. The shielded coordinator is silently and permanently quiesced — manual `sync_now`, `sync_wallet`, and the background loop all stop doing work — even though the host application has handled the timeout and moved on.

  2. **Superseded / inner-NotRunning return paths inside `registry.quiesce`.** The early exit at registry.rs:693 (cancel and handle both absent → drain hook never fires) is safe under the current condition. The two NotRunning returns *inside* the join loop are not: `Step::Superseded` (registry.rs:760/777) and the inner `Step::NotRunning` (registry.rs:763/777) both happen *after* the hook has already fired at line 702, but return `WorkerStatus::NotRunning`, so the conditional `fetch_sub` is skipped — same leak. Today these branches are unreachable in the wallet because the `closing` / `hold_clearing` latches prevent same-key restart racing `quiesce`, but the comment at lines 277-284 documents an invariant stricter than what `registry.quiesce` actually guarantees. The PR description explicitly anticipates rs-dapi adopting `ThreadRegistry`; any future consumer that calls `quiesce()` without an external restart-prevention mutex inherits a silent gate leak.

  The fundamental issue is the asymmetric ownership of the refcount: the registry's `drain_hook` raises the gate, but the wallet has to release it from outside the registry, with no signal for "did the hook actually fire on this call." The cleanest fixes are (a) remove the `drain_hook` contribution to `quiescing` entirely — the public `quiesce()` already raises the gate manually before calling `cancel_join_and_drain` (line 236), and `quiesce_under_held_gate` requires the caller to hold one, so the hook's raise is redundant; or (b) restructure so the hook's contribution is held by a cancellation-safe guard the registry hands back to the caller via `quiesce`'s return value, so Drop on any path (including future cancellation) releases the hook's ref. Either way, the conditional decrement here is unsafe and the comment overstates the registry's contract.

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Latest delta correctly resolves prior-1 by removing the registry-side drain_hook (worker_config now returns drain: None) and relying on the lifecycle's cancellation-safe RefcountedFlagGuard. The new regression test (quiesce_cancellation_during_drain_does_not_leak_quiescing_refcount) is non-vacuous. New issue introduced by the same delta: PlatformWalletManager::shutdown() goes through registry.shutdown() -> registry.quiesce(key), never through CoordinatorLifecycle::quiesce(), so the per-coordinator quiescing gate is no longer raised at all during manager shutdown. The pre-fix drain hook used to provide that barrier; both reviewers independently flag its loss. The mod.rs:455 docstring is now stale relative to the implementation, and the prior race window where a concurrent host-issued sync_now/sync_wallet can slip past begin_pass during destroy is reopened.

🔴 1 blocking

Carried-forward prior findings

  • Prior f424caf blocker is FIXED: FIXED at commit 36aa3a3 via the recommended fix (a) from the prior review: CoordinatorLifecycle::worker_config() now returns drain: None (coordinator_lifecycle.rs:111-117), so no registry-side fetch_add is ever paired with a wallet-side conditional fetch_sub. The gate is now raised exclusively by the lifecycle's own RefcountedFlagGuardquiesce()'s local guard at line 221 and the Clear flow's hold_quiescing_gate — both of whose Drop is cancellation-safe (runs on future-drop unwind). New non-vacuous regression test quiesce_cancellation_during_drain_does_not_leak_quiescing_refcount (lines 613-711) aborts a quiesce() future mid-drain while a background pass is parked in std::thread::sleep and asserts the refcount returns to 0 — it would time out against the prior drain-hook design. Both the cancellation/timeout leak path and the Superseded/inner-NotRunning paths from the prior finding are eliminated by removing the asymmetric ownership entirely. (Note: the removal also eliminates the gate during PlatformWalletManager::shutdown() — see the new blocking finding above.)

New findings in the latest delta

In packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:

  • [BLOCKING] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:99-117: Manager shutdown path no longer raises the quiescing gate — direct sync_now/sync_wallet can slip past begin_pass during FFI destroy
    Setting worker_config().drain = None correctly closes the cancellation-safety hole flagged in prior-1, but it also removes the only mechanism by which PlatformWalletManager::shutdown() raised the per-coordinator quiescing gate. shutdown() (manager/mod.rs:487-489) calls self.registry.shutdown().await, which iterates workers and calls self.quiesce(key) on the registry directly (registry.rs:834-866). It never goes through CoordinatorLifecycle::quiesce() (the only path that now raises the gate via the local RefcountedFlagGuard at line 221) or quiesce_under_held_gate() (the Clear flow's path). With drain: None, registry.quiesce runs no hook and the gate stays at 0 throughout the entire teardown.

Consequences:

  1. Production race during FFI destroy. The FFI bridge invokes PlatformWalletManager::shutdown(), which is meant to be the use-after-free barrier: cancel the periodic loops, drain any in-flight pass, then return so the host can free the persister and event-handler context. With the gate never raised, a concurrent host-issued sync_now / sync_wallet on another thread can pass begin_pass's quiescing > 0 check (line 357), claim is_syncing, run its body after the registry has cancelled/joined the background loop, and call persister.store(...) or fire a host completion callback after destroy returned and the host freed the callback context. The pre-fix drain hook's fetch_add ran inside registry.quiesce and held the gate up across each per-worker teardown, blocking exactly this race. That barrier is now gone for the shutdown path, even though it is still in place for lifecycle.quiesce() direct callers and the shielded Clear flow.

  2. Stale documented invariant. The mod.rs:455 docstring on shutdown() still claims "Each worker's drain raises its quiescing gate, cancels the loop, and joins its OS thread / task" and uses that to argue the use-after-free contract holds. The implementation no longer matches that contract: the docstring will mislead anyone (including future Claude) reasoning about destroy-time safety.

Fix options: (a) have PlatformWalletManager::shutdown() route per-coordinator teardown through each coordinator's lifecycle.quiesce() instead of (or in addition to) registry.shutdown(), so the lifecycle-owned RefcountedFlagGuard is in scope across the drain — its Drop is already cancellation-safe so this does not reintroduce prior-1's leak; (b) reintroduce a cancellation-safe drain hook (e.g. a hook that returns the RefcountedFlagGuard to be held by the registry across the same await frame as the join poll, dropped on every exit path including future cancellation), so registry-level teardown still raises the gate without the asymmetric fetch_add/fetch_sub of the original design. Either way, update the mod.rs:455 docstring so the documented destroy-time invariant matches what the code actually does.

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:
- [BLOCKING] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:99-117: Manager shutdown path no longer raises the quiescing gate — direct sync_now/sync_wallet can slip past begin_pass during FFI destroy
  Setting `worker_config().drain = None` correctly closes the cancellation-safety hole flagged in prior-1, but it also removes the only mechanism by which `PlatformWalletManager::shutdown()` raised the per-coordinator `quiescing` gate. `shutdown()` (manager/mod.rs:487-489) calls `self.registry.shutdown().await`, which iterates workers and calls `self.quiesce(key)` on the registry directly (registry.rs:834-866). It never goes through `CoordinatorLifecycle::quiesce()` (the only path that now raises the gate via the local `RefcountedFlagGuard` at line 221) or `quiesce_under_held_gate()` (the Clear flow's path). With `drain: None`, `registry.quiesce` runs no hook and the gate stays at 0 throughout the entire teardown.

Consequences:

1. Production race during FFI `destroy`. The FFI bridge invokes `PlatformWalletManager::shutdown()`, which is meant to be the use-after-free barrier: cancel the periodic loops, drain any in-flight pass, then return so the host can free the persister and event-handler context. With the gate never raised, a concurrent host-issued `sync_now` / `sync_wallet` on another thread can pass `begin_pass`'s `quiescing > 0` check (line 357), claim `is_syncing`, run its body after the registry has cancelled/joined the background loop, and call `persister.store(...)` or fire a host completion callback after `destroy` returned and the host freed the callback context. The pre-fix drain hook's `fetch_add` ran inside `registry.quiesce` and held the gate up across each per-worker teardown, blocking exactly this race. That barrier is now gone for the shutdown path, even though it is still in place for `lifecycle.quiesce()` direct callers and the shielded Clear flow.

2. Stale documented invariant. The mod.rs:455 docstring on `shutdown()` still claims "Each worker's drain raises its `quiescing` gate, cancels the loop, and joins its OS thread / task" and uses that to argue the use-after-free contract holds. The implementation no longer matches that contract: the docstring will mislead anyone (including future Claude) reasoning about destroy-time safety.

Fix options: (a) have `PlatformWalletManager::shutdown()` route per-coordinator teardown through each coordinator's `lifecycle.quiesce()` instead of (or in addition to) `registry.shutdown()`, so the lifecycle-owned `RefcountedFlagGuard` is in scope across the drain — its Drop is already cancellation-safe so this does not reintroduce prior-1's leak; (b) reintroduce a cancellation-safe drain hook (e.g. a hook that returns the `RefcountedFlagGuard` to be held by the registry across the same await frame as the join poll, dropped on every exit path including future cancellation), so registry-level teardown still raises the gate without the asymmetric fetch_add/fetch_sub of the original design. Either way, update the mod.rs:455 docstring so the documented destroy-time invariant matches what the code actually does.

Note: Posted as a body-only review because GitHub refuses gh pr diff for this PR with PullRequest.diff too_large (>20,000 lines), so the normal inline mapper could not run.

thepastaclaw pushed a commit to thepastaclaw/platform that referenced this pull request Jun 26, 2026
…builds

PR dashpay#3954 review thread dashpay#5 (comment 3481707091) asked for runtime visibility
when the registry's reap path runs under `panic = "abort"`: an EpilogueGuard
or AtomicFlagGuard panic during teardown aborts the process instead of
unwinding, so Drop never releases the orphan-liveness gate and the registry's
join-and-orphan-liveness invariant can stay held forever in the worst case.

Stable Rust has NO runtime API to query the active panic strategy — only
`#[cfg(panic = "...")]` at compile time. iOS release builds intentionally
choose `panic = "abort"`, so a hard gate (compile-time error or refusal to
construct) is wrong. A compile-time-gated, one-shot `tracing::warn!` from
`ThreadRegistry::with_reap_backstop` is a strict improvement over the
doc-caveat-only state today: operators get a single audit-trail line at
startup pointing them at the registry.rs / atomic.rs caveats.

The warn is gated by a process-wide `std::sync::Once` so repeated registry
construction (tests, restart paths, multi-instance hosts) fires it once.
The static is hoisted to module scope (also `#[cfg(panic = "abort")]`) so
the in-crate regression test can probe `Once::is_completed()` directly
without subscribing to tracing from a `#[test]`.

Doc caveats on `EpilogueGuard` (registry.rs) and `AtomicFlagGuard`
(atomic.rs) are extended with a one-liner pointing operators at the new
constructor-emitted warn.

Tests:
- `with_reap_backstop_emits_panic_abort_warn_under_abort_builds`
  (only compiled under abort) — proves the Once trips on first call and
  stays one-shot across a second construction. Build-cfg-pinned: if the
  cfg block in `with_reap_backstop` is ever deleted, this test disappears
  with it and CI loses the signal.
- `with_reap_backstop_no_warn_under_unwind` (only compiled under unwind) —
  sentinel proving the no-op cfg branch compiles and `with_reap_backstop`
  keeps behaving like a plain constructor on the default profile.

Verification:
- `cargo test -p dash-async --all-targets` — 36 passed (was 35).
- `cargo clippy -p dash-async --all-targets -- -D warnings` — clean.
- `cargo fmt --all -- --check` — clean.
- `cargo check -p platform-wallet -p platform-wallet-ffi --all-targets` — clean.
- `RUSTFLAGS="-C panic=abort" cargo build -p dash-async --lib` — clean.
  (`--tests` requires nightly `-Zpanic_abort_tests`; lib path covers the
  new cfg branch.)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01BcZFae6ABdoMEvCBQZnaJF
@Claudius-Maginificent

Copy link
Copy Markdown
Collaborator Author

Blocking finding addressed in 6c9ed0a890 — both the verification result and the structural fix.

On the asserted UAF (refuted at HEAD, but the point stands): the concurrent direct sync_now/sync_wallet vs. FFI destroy race is not actually reachable at this HEAD — the FFI HandleStorage RwLock already serializes them: every sync entrypoint runs under with_item (read lock held across block_on(sync_now)), while destroy takes the write lock via remove before block_on(shutdown()). They can never overlap. So no production UAF existed.

But the mechanical observations were correct and worth fixing: with drain: None, shutdown() no longer raised the per-coordinator quiescing gate, the manager leaned on the FFI lock rather than being self-contained, and the mod.rs docstring was stale. We took your option (a):

  • shutdown() now quiesces the weight-0 coordinators first, concurrently, each via CoordinatorLifecycle::quiesce() — whose RefcountedFlagGuard holds the gate raised across cancel → join → in-flight-pass drain. A concurrent direct sync reaching begin_pass now observes quiescing > 0 and bails, even without the FFI lock.
  • No registry drain hook reintroduced, so option (b)'s fetch_add/fetch_sub-across-await leak (the one HEAD just fixed) stays fixed. The guard's Drop returns the refcount to 0 on every exit path including await-cancellation.
  • Weight ordering preserved: the weight-10 event adapter that sinks the stores is torn down after the coordinators, via ThreadRegistry::shutdown().
  • mod.rs:455 docstring rewritten to describe the lifecycle-owned gate (not a registry drain hook) and the multi-thread-runtime precondition (which ThreadRegistry::shutdown already asserts; shutdown() now fails fast up front).
  • New non-vacuous regression test shutdown_raises_quiescing_gate_during_teardown — asserts the gate is raised mid-teardown, a concurrent sync_now does no work, and the refcount returns to 0. Verified it fails against the prior registry-only shutdown() body.

cargo fmt clean, cargo clippy -p platform-wallet --all-targets 0 warnings, cargo test -p platform-wallet 221 passed / 0 failed.

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The prior shutdown-gate blocker is fixed in the current head because manager shutdown now routes coordinator teardown through each lifecycle quiesce(). However, the new implementation releases those lifecycle gates before the registry shutdown latch is set, leaving a narrow but real post-quiesce race where direct syncs or restarts can enter after the coordinator barrier but before the rest of shutdown is closed.

🔴 1 blocking

Findings not posted inline (1)

These findings could not be anchored to the current diff, but they are still part of this review.

  • [BLOCKING] packages/rs-platform-wallet/src/manager/mod.rs:516-525: Coordinator gates are released before shutdown is latched closedquiesce_coordinators().await raises each lifecycle gate only for the duration of that coordinator's quiesce(). As soon as line 516 returns, those guards have dropped, but ThreadRegistry::shutdown() has not yet set the registry closing latch until line 523 begins polling the registry futur...
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-platform-wallet/src/manager/mod.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/mod.rs:516-525: Coordinator gates are released before shutdown is latched closed
  `quiesce_coordinators().await` raises each lifecycle gate only for the duration of that coordinator's `quiesce()`. As soon as line 516 returns, those guards have dropped, but `ThreadRegistry::shutdown()` has not yet set the registry `closing` latch until line 523 begins polling the registry future. A concurrent Rust caller holding the manager or coordinator `Arc` can enter `sync_now()` / shielded `sync_wallet()` in that gap, see `quiescing == 0`, claim `is_syncing`, and continue persisting or firing callbacks while shutdown tears down the event adapter and returns. A concurrent `start()` can also install a fresh coordinator before `closing` is set; the registry may then correctly report that new worker's shutdown status, but the merge at lines 524-525 overwrites it with the stale status captured before the restart. Hold the coordinator teardown gates, or set an equivalent manager/registry closing barrier, across the entire registry shutdown phase before reporting that shutdown is clean.

@Claudius-Maginificent

Copy link
Copy Markdown
Collaborator Author

Round-2 blocking finding fixed in 17e1c07901. Good catch — verified it was a real hole in the round-1 fix, not just a phantom.

Confirmed: round-1 held each coordinator's quiescing gate only for the duration of its own quiesce() (the RefcountedFlagGuard was local to lifecycle.quiesce). On quiesce_coordinators() return the gates were back at 0 while registry.shutdown() tore down the weight-10 event adapter + reaped orphans, and begin_pass checks only the per-coordinator quiescing atomic — so a direct sync_now/sync_wallet in that gap could persist while the store sink was torn down. (Unreachable via FFI — the HandleStorage RwLock serializes destroy's remove/write vs sync's with_item/read — but reachable by any downstream Rust consumer of the public API, which defeats the self-containment goal, so it's worth fixing.)

Fix: shutdown() now acquires a held RefcountedFlagGuard per coordinator via hold_coordinator_gates() before quiescing and keeps them in scope across quiesce_coordinators(), registry.shutdown(), and the report build — dropped only at function end. The refcount stays > 0 for the entire teardown (composing with each quiesce's own transient raise) and returns to 0 only when shutdown completes/unwinds (Drop is cancellation-safe). Confirmed holding the gate across registry.shutdown() doesn't deadlock — the registry never reads the per-coordinator quiescing atomic.

Restart/merge: also fixed. The per-worker merge now overwrites a coordinator's slot only when the registry classified it NotRunning, so a concurrent start()'s fresh, live worker status is preserved instead of being clobbered by the stale pre-restart status.

Test: new shutdown_holds_gate_through_registry_teardown parks a wedged orphan so the registry reap blocks ~1s, then asserts the gate is still raised in that phase, a direct sync_now() bails purely on quiescing > 0, and the refcount returns to 0 only after shutdown completes. Verified it fails against the 6c9ed0a890 body and passes after the fix.

cargo fmt clean, cargo clippy --package platform-wallet --all-targets 0 warnings, cargo test --package platform-wallet --lib 222 passed / 0 failed.

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Carried-forward prior findings: prior-1 is FIXED. The current head raises every coordinator gate before coordinator quiesce and holds those guards through registry shutdown and status merging, and I found no new in-scope latest-delta findings. Targeted test execution could not complete because the build needed to download Tenderdash proto sources and the network escalation retry was rejected.

Note: Posted body-only because GitHub refused the PR diff (PullRequest.diff too_large, over 20,000 lines), so inline mapping could not run. The verified output contained no findings.

… join [#3954, independent on v3.1-dev]

Introduce a shared `ThreadRegistry<WalletWorker>` lifecycle engine in
`dash-async` (weight-ordered graceful shutdown: lower weights drain first,
equal weights concurrently; reap-or-park orphan handling; cancellation-safe
`RefcountedFlagGuard`). Migrate the three periodic sync coordinators
(identity / platform-address / shielded) onto a shared `CoordinatorLifecycle`
that drives them through the registry and gates passes via an
`is_syncing` / `quiescing` handshake.

`PlatformWalletManager::shutdown()` now holds every coordinator's quiescing
gate across the whole teardown, quiesces the weight-0 coordinators
concurrently, then drains the weight-10 wallet-event adapter and any parked
orphans, returning a per-worker `ShutdownReport`. `clear_shielded` refuses
(leaving the commitment-tree store intact) when the in-flight pass does not
drain cleanly, surfaced as `ShieldedShutdownIncomplete` /
`ErrorShutdownIncomplete = 19`. FFI `destroy` reports the incomplete-shutdown
code so the Swift host defers freeing its callback context.

Reconstructed independently on a clean v3.1-dev base: only the graceful-
shutdown contribution is included; the drifted #3692 rehydration/storage
work merged into the source branch is excluded.

<sub>🤖 Co-authored by [Claudius the Magnificent](https://github.com/lklimek/claudius) AI Agent</sub>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@lklimek lklimek force-pushed the feat/platform-wallet-shutdown-join branch from 17e1c07 to a89e0f9 Compare June 29, 2026 13:44

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

No carried-forward prior findings from the 17e1c07 review; the earlier shutdown-gate issue remains fixed. I verified one new in-scope latest-delta regression: the SQLite same-path open guard was removed while buffers remain per-persister. The destroy finding was dropped because the PR explicitly documents destroy as terminal and the Swift wrapper implements that contract by retaining callback contexts on incomplete shutdown.

🔴 2 blocking

Findings not posted inline (1)

These findings could not be anchored to the current diff, but they are still part of this review.

  • [BLOCKING] packages/rs-platform-wallet-storage/src/sqlite/persister.rs:205-208: Same-path SQLite opens now create divergent buffers — The latest delta removed the process-wide canonical-path registry and the AlreadyOpen error, so open() now returns a fresh SqlitePersister with its own Buffer::new() for a database path that may already be live in this process. SQLite can serialize committed transactions, but it cannot co...
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-platform-wallet-storage/src/sqlite/persister.rs`:
- [BLOCKING] packages/rs-platform-wallet-storage/src/sqlite/persister.rs:205-208: Same-path SQLite opens now create divergent buffers
  The latest delta removed the process-wide canonical-path registry and the `AlreadyOpen` error, so `open()` now returns a fresh `SqlitePersister` with its own `Buffer::new()` for a database path that may already be live in this process. SQLite can serialize committed transactions, but it cannot coordinate these Rust-side buffers: `store()` stages data in the handle-local buffer, `flush()` drains only that buffer, and `load()` reads through only that persister's connection. Two in-process managers using the same wallet DB can therefore miss each other's pending state or commit competing buffered changes in last-flush-wins order. Restore the same-path guard, or move the buffer into a shared path-scoped owner before allowing multiple live handles for one path.
- [BLOCKING] packages/rs-platform-wallet-storage/src/sqlite/persister.rs:205-208: Same-path SQLite opens now create divergent buffers
  The latest delta removed the process-wide canonical-path registry and the `AlreadyOpen` error, so `open()` now returns a fresh `SqlitePersister` with its own `Buffer::new()` for a database path that may already be live in this process. SQLite can serialize committed transactions, but it cannot coordinate these Rust-side buffers: `store()` stages data in the handle-local buffer, `flush()` drains only that buffer, and `load()` reads through only that persister's connection. Two in-process managers using the same wallet DB can therefore miss each other's pending state or commit competing buffered changes in last-flush-wins order. Restore the same-path guard, or move the buffer into a shared path-scoped owner before allowing multiple live handles for one path.

@Claudius-Maginificent

Copy link
Copy Markdown
Collaborator Author

Re: thepastaclaw's 2026-06-29 re-review (persister.rs:205-208, same-path opens → divergent per-persister buffers).

This is a storage-crate concern that belongs to #3968, not this PR. #3954 (rs-dash-async ThreadRegistry + manager lifecycle) has a zero-line diff against rs-platform-wallet-storage — it does not touch persister.rs or the SQLite open path. Under the de-stacking this reads as "removed" only because the storage hardening lives in the sibling PR, not here.

The concern itself is real and already addressed in #3968: the process-wide canonical-path registry + WalletStorageError::AlreadyOpen guard is implemented at persister.rs:78-110 (is_path_open() / open() canonicalizes the path and refuses a second in-process open of the same DB), so a same-path re-open returns AlreadyOpen instead of a divergent-buffer persister.

Dispositioning as out of scope for #3954; tracked and fixed in #3968.

🤖 Co-authored by Claudius the Magnificent AI Agent

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Thanks for the correction. I rechecked current head a89e0f97daefc9363005644a77c0b1b4fa000509 and agree the same-path SQLite open finding is out of scope for #3954: this PR's diff has no rs-platform-wallet-storage files, and the storage persister guard belongs to the sibling storage PR #3968 / its branch context rather than this ThreadRegistry + wallet lifecycle PR.

I am superseding my 2026-06-29 CHANGES_REQUESTED review for #3954. I do not see any remaining in-scope blockers on the current #3954 head from the prior shutdown-gate review trail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants