Skip to content

opt: parallelize FRI fold with rayon#448

Closed
MauroToscano wants to merge 7 commits into
mainfrom
opt/11-parallel-fri-fold
Closed

opt: parallelize FRI fold with rayon#448
MauroToscano wants to merge 7 commits into
mainfrom
opt/11-parallel-fri-fold

Conversation

@MauroToscano

Copy link
Copy Markdown
Contributor

Summary

Parallelize fold_evaluations_in_place in the FRI commit phase using rayon::par_iter. The fold loop was previously sequential over N/2 extension field elements. Since the in-place fold has aliasing (evals[j] reads from evals[2*j]), the parallel version uses a temporary buffer + clone_from_slice.

Benchmark (Apple Silicon M3 Pro, PARALLEL_TABLES=1, fib_iterative_1M)

Metric Before After Delta
Prove time (median, 3 samples) 62.6s 60.3s -3.7%
CV - 0.4%
Heap 23,300 MB 23,336 MB +0.2%
Verification - PASS

The temp buffer for the parallel version is ~36MB (N/2 extension elements) — small vs the 23GB working set.

Test plan

  • cargo test --release -p stark (all passed)
  • Proof verified against baseline verifier binary
  • /bench on CI runner

@github-actions

Copy link
Copy Markdown

Codex Code Review

  1. Medium - Potential data races / cross-request contamination in instrumentation state

  2. Medium - Hot-path performance regression in FRI fold

    • The parallel version of fold_evaluations_in_place now allocates a new Vec and then copies it back on every fold layer.
    • FRI folding is on the proving hot path and called repeatedly; repeated alloc+copy can materially increase runtime and allocator pressure on large proofs.
    • Reference: crypto/stark/src/fri/fri_functions.rs:68, crypto/stark/src/fri/fri_functions.rs:79.
    • Action: reuse a scratch buffer across rounds or use a two-buffer strategy to avoid per-round allocations.

No direct security vulnerabilities (unsafe/memory-safety/crypto correctness/VM privilege issues) stood out in the changed diff.

@claude

claude Bot commented Mar 18, 2026

Copy link
Copy Markdown

Review: opt: parallelize FRI fold with rayon

FRI Fold Parallelization is correct. Using a temporary buffer to avoid aliasing (reading evals[2*j] while writing evals[j]) is the right approach. Both sequential and parallel paths produce identical results.

Issues Found:

[Low] Dead code in bench script
stddev() is defined in scripts/bench_prove.sh but never called. Only cv_pct() is used. Remove it.

[Low] Lossy as u64 cast on nanosecond accumulator
In crypto/stark/src/instruments.rs, as_nanos() returns u128 but is cast to u64 in accum_r1_main/accum_r1_aux. Benign in practice but consider .min(u64::MAX as u128) as u64 to be explicit.

[Low] Misleading variable name r4_merkle_dur
This variable captures time for commit_phase_from_evaluations (the full FRI commit phase), not just Merkle hashing. It maps correctly to the fri_commit field in TableSubOps, but the intermediate name is confusing. Consider r4_fri_commit_dur.

[Informational] reset_all() only clears main-thread TLS
reset_all() resets the four global atomics but only clears thread-local slots on the calling thread. Rayon worker threads retain stale TLS across successive multi_prove calls. Safe in the current design since store always precedes take within the same rayon closure, but a panic inside prove_rounds_2_to_4 would leave stale data on that worker. The unwrap_or_default() guard on take_round_sub_ops() silently zeroes it -- worth a comment explaining this invariant.

No correctness or security issues.

@MauroToscano

Copy link
Copy Markdown
Contributor Author

/bench

@diegokingston diegokingston force-pushed the opt/11-parallel-fri-fold branch from 76dd9cc to ddaca1a Compare March 20, 2026 14:07
The FRI fold loop was sequential over half the evaluation points.
Parallelize with rayon par_iter for the first few layers where
domain_size is large enough to benefit.
@diegokingston diegokingston force-pushed the opt/11-parallel-fri-fold branch from 55b1a19 to 373be35 Compare March 20, 2026 14:15
@diegokingston

Copy link
Copy Markdown
Collaborator

/bench

@github-actions

github-actions Bot commented Mar 20, 2026

Copy link
Copy Markdown

Benchmark — fib_iterative_8M (median of 3)

Table parallelism: 32 (auto = cores / 3)

Metric main PR Δ
Peak heap 67626 MB 66991 MB -635 MB (-0.9%) ⚪
Prove time 35.276s 35.686s +0.410s (+1.2%) ⚪

✅ No significant change.

⚠️ Heap spread: 9.1% (72145 MB / 66074 MB / 66991 MB)
Consider re-running /bench

Commit: f4fbb09 · Baseline: cached · Runner: self-hosted bench

@nicole-graus

Copy link
Copy Markdown
Collaborator

/bench

@nicole-graus

Copy link
Copy Markdown
Collaborator

/bench

1 similar comment
@nicole-graus

Copy link
Copy Markdown
Collaborator

/bench

@MauroToscano

Copy link
Copy Markdown
Contributor Author

/bench 3 1

@MauroToscano

Copy link
Copy Markdown
Contributor Author

/bench 3 1

@MauroToscano

Copy link
Copy Markdown
Contributor Author

/bench

@MauroToscano

Copy link
Copy Markdown
Contributor Author

Closing this — superseded by #597, which is the newer take on the same FRI fold parallelization and adds a size threshold to avoid Rayon overhead on the small final layers. #597 still needs a more thorough review (rebase onto current main, fmt fix, and a real /bench on the runner), but it's the version we'll carry forward rather than this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants