Skip to content

HBASE-29853 Move from async profiler binary to a maven dependency#7679

Open
mnpoonia wants to merge 12 commits into
apache:masterfrom
mnpoonia:hbase_async_profiler
Open

HBASE-29853 Move from async profiler binary to a maven dependency#7679
mnpoonia wants to merge 12 commits into
apache:masterfrom
mnpoonia:hbase_async_profiler

Conversation

@mnpoonia

@mnpoonia mnpoonia commented Jan 25, 2026

Copy link
Copy Markdown
Contributor

Summary

Migrates async-profiler from an optional external binary (ASYNC_PROFILER_HOME) to an optional Maven dependency (-Pasync-profiler), eliminating the need to manually install async-profiler on each node.

Architecture

Introduces a ProfilerBackend abstraction with two implementations:

  • LibraryBackend — in-process via AsyncProfiler.getInstance() (used when -Pasync-profiler builds the Maven dep onto the classpath). Profiles only the current JVM.
  • BinaryBackend — external asprof/profiler.sh process (used when ASYNC_PROFILER_HOME is set). Supports cross-process profiling via ?pid=<pid>.

LibraryBackend is preferred when both are available. Detection runs once at class-load time — a restart is required if the library is added later.

Behavior changes from master

  • PID auto-detected: ?pid= is no longer required. LibraryBackend uses ProcessHandle.current().pid(); BinaryBackend uses ProcessHandle.current().pid() as the fallback (Java 9+, always correct — removed the legacy ProcessUtils.getPid() / JVM_PID env-var path that could resolve to a stale or incorrect value).
  • ?pid=<external> rejected for LibraryBackend: returns HTTP 400. LibraryBackend can only profile the current JVM. BinaryBackend still accepts external PIDs.
  • Profiling is single-flight: only one session at a time regardless of target PID. A second request while profiling is active returns HTTP 409 Conflict, including a link to the last completed result. Closing the browser tab does not cancel the session — the stopper thread runs to completion on the server. A cancel endpoint does not exist (known limitation, tracked separately).
  • output=svg remapped to html: SVG was dropped in async-profiler 2.0; a deprecation warning is logged.
  • bufsize/width/height ignored by LibraryBackend: not recognized by async-profiler 4.x agent grammar; a note is included in the 202 response body if supplied. Still honored by BinaryBackend CLI.
  • ?last — view last result: GET /prof?last redirects to the most recently completed profiling output without starting a new session. The last result is cached in memory (survives servlet reloads within the same JVM).

Build / upgrade notes

  • Default build (mvn install) has no behavior changeASYNC_PROFILER_HOME path still works.
  • Add -Pasync-profiler to build with the Maven dependency. When present, /prof is enabled by default (hbase.profiler.enabled=true). Set to false to opt out.
  • Native library compatibility (glibc version, kernel perf_event_paranoid, seccomp) is not checked at startup — incompatible environments surface as HTTP 500 on first use with a descriptive message.
  • If you build with -Pasync-profiler, /prof is now enabled by default. Set hbase.profiler.enabled=false to opt out.
  • ProcessHandle.current().pid() requires JDK 9+. master targets JDK 17 so this is fine. Backports to JDK 8 branches are not supported.

Correctness / robustness

  • Orphan placeholder cleanup: Files.write(empty placeholder) runs before executeStart. If executeStart throws (or the stopper thread fails to start), the placeholder is deleted from the catch block — guarded by !stopperStarted && outputFile != null to cover all three orphan paths (write throws, executeStart throws, t.start() OOM).
  • Stopper thread error paths: both RuntimeException/Error from executeStop and InterruptedException are caught; the output file is padded to > PROF_OUTPUT_MIN_BYTES so ProfileOutputServlet stops auto-refreshing instead of looping forever.
  • Committed-response guard: if writeAcceptedResponse throws after writing the 202 status line (e.g., broken pipe), the catch block skips writeError when resp.isCommitted() to avoid silently swallowed status codes.
  • detect() safety: widened catch to include ExceptionInInitializerError (not a subclass of UnsatisfiedLinkError) and a catch(Throwable) backstop, so a broken async-profiler JAR at class-load time falls back to DisabledServlet rather than crashing DETECTED_BACKEND permanently.

Known limitations

  • No cancel endpoint: a running profiling session holds the single lock for its full duration. Closing the browser tab does not stop it.
  • BinaryBackend: profiling=true held for full duration even on early asprof crash (pre-existing behavior, not introduced by this PR): the stopper thread sleeps for durationSeconds before calling executeStop, so if asprof exits immediately (bad event name, permission denied), the /prof endpoint remains blocked for the configured duration. LibraryBackend is not affected. Tracked as a follow-up.
  • profiler.mdx website docs are stale (tracked in follow-up JIRA).

Tests

Added unit tests for: concurrency guard (409), error paths (UnsatisfiedLinkError, RuntimeException), enum fallbacks (unknown output/event), duration clamping, DisabledServlet reason messages, collapsed bare token, SVG remap, library command grammar, ?last redirect, last result cache, orphan file deletion on executeStart failure, stopper padded-error on executeStop failure, stopper padded-error on interrupt.

@Apache-HBase

This comment has been minimized.

@mnpoonia mnpoonia force-pushed the hbase_async_profiler branch from 91ff534 to 8a7bcaa Compare January 26, 2026 16:32
@Apache-HBase

Copy link
Copy Markdown

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 39s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
_ master Compile Tests _
+0 🆗 mvndep 0m 25s Maven dependency ordering for branch
+1 💚 mvninstall 4m 21s master passed
+1 💚 compile 11m 54s master passed
+1 💚 checkstyle 2m 29s master passed
+1 💚 spotbugs 11m 26s master passed
+1 💚 spotless 1m 12s branch has no errors when running spotless:check.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 19s Maven dependency ordering for patch
+1 💚 mvninstall 3m 48s the patch passed
+1 💚 compile 13m 25s the patch passed
+1 💚 javac 13m 25s root generated 0 new + 1886 unchanged - 2 fixed = 1886 total (was 1888)
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 3m 0s root: The patch generated 0 new + 0 unchanged - 1 fixed = 0 total (was 1)
+1 💚 xmllint 0m 0s No new issues.
+1 💚 spotbugs 11m 30s the patch passed
+1 💚 hadoopcheck 18m 11s Patch does not cause any errors with Hadoop 3.3.6 3.4.1.
+1 💚 spotless 1m 3s patch has no errors when running spotless:check.
_ Other Tests _
+1 💚 asflicense 0m 39s The patch does not generate ASF License warnings.
94m 23s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7679/2/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #7679
Optional Tests dupname asflicense javac codespell detsecrets xmllint hadoopcheck spotless compile spotbugs checkstyle hbaseanti
uname Linux 93f23ab627a1 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / 8a7bcaa
Default Java Eclipse Adoptium-17.0.11+9
Max. process+thread count 194 (vs. ulimit of 30000)
modules C: hbase-resource-bundle hbase-http . U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7679/2/console
versions git=2.34.1 maven=3.9.8 spotbugs=4.7.3 xmllint=20913
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase

Copy link
Copy Markdown

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 37s Docker mode activated.
-0 ⚠️ yetus 0m 3s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+0 🆗 mvndep 0m 23s Maven dependency ordering for branch
+1 💚 mvninstall 4m 42s master passed
+1 💚 compile 3m 35s master passed
+1 💚 javadoc 3m 12s master passed
+1 💚 shadedjars 7m 58s branch has no errors when building our shaded downstream artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 30s Maven dependency ordering for patch
+1 💚 mvninstall 4m 32s the patch passed
+1 💚 compile 3m 26s the patch passed
+1 💚 javac 3m 26s the patch passed
+1 💚 javadoc 3m 49s the patch passed
+1 💚 shadedjars 7m 49s patch has no errors when building our shaded downstream artifacts.
_ Other Tests _
-1 ❌ unit 263m 57s /patch-unit-root.txt root in the patch failed.
310m 49s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7679/2/artifact/yetus-jdk17-hadoop3-check/output/Dockerfile
GITHUB PR #7679
Optional Tests javac javadoc unit shadedjars compile
uname Linux 9f19bff6798d 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / 8a7bcaa
Default Java Eclipse Adoptium-17.0.11+9
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7679/2/testReport/
Max. process+thread count 4615 (vs. ulimit of 30000)
modules C: hbase-resource-bundle hbase-http . U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7679/2/console
versions git=2.34.1 maven=3.9.8
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@apurtell

Copy link
Copy Markdown
Contributor

Async profiler is an optional dependency. The availability of a profiler on the server side can be a security risk and so some (or many) may opt not to place one there. This change does not gracefully handle the absence of the profiler nor make its deployment optional.

@mnpoonia mnpoonia force-pushed the hbase_async_profiler branch from 8a7bcaa to 268b567 Compare May 3, 2026 10:45
@mnpoonia mnpoonia force-pushed the hbase_async_profiler branch from 268b567 to 05919a4 Compare May 3, 2026 11:00
@mnpoonia

mnpoonia commented May 3, 2026

Copy link
Copy Markdown
Contributor Author

Async profiler is an optional dependency. The availability of a profiler on the server side can be a security risk and so some (or many) may opt not to place one there. This change does not gracefully handle the absence of the profiler nor make its deployment optional.

Good point. Updated PR addresses both concerns: the dependency is now marked true so it is never transitively pulled into downstream modules. Users who want to bundle it can activate
the async-profiler Maven profile (-Pasync-profiler). Absence is handled gracefully — HttpServer checks ProfileServlet.isAvailable() at startup and falls back to DisabledServlet (returning HTTP 500
with a clear message) when neither the JAR nor ASYNC_PROFILER_HOME is present.

@mnpoonia

mnpoonia commented May 5, 2026

Copy link
Copy Markdown
Contributor Author

@apurtell FYI. Hopefully i have addressed the concerns.

@mnpoonia

Copy link
Copy Markdown
Contributor Author

@virajjasani Please review

@apurtell apurtell left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there. There's a first use case where we could be more graceful in explaining the error.

Also consider a hbase.profiler.enabled = < true | false > configuration setting that allows operators to selectively disable the profiler even if the binaries are present in the deployment. Just a thought.

startStopperThread(request.getDuration(), request, outputFile);

writeAcceptedResponse(resp, request, outputFile);
} catch (InterruptedException e) {

@apurtell apurtell May 20, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The native library won't actually be loaded until first use, which will happen in doGet, and there could be other native layer issues such as a glibc missing some versioned symbols, kernel disabled perf_event support, a seccomp policy, etc. Catching Error here, or even Throwable and then testing for instanceof InterruptedException to deal with that specific case would handle all of this. Otherwise UnsatisfiedLinkError (most likely) will bubble up to the servlet container which will emit a generic 500 error response.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

} catch (InterruptedException e) {
    Thread.currentThread().interrupt();     // handled first, specifically
    ...
} catch (IOException | Error | RuntimeException e) {
    // Catches:
    // - IOException: AsyncProfiler.execute() throws IOException for invalid agent commands
    // - UnsatisfiedLinkError / other Error: native lib absent or incompatible OS/kernel
    // - IllegalStateException / IllegalArgumentException (RuntimeException): double-start,
    //   unsupported event, rejected format from the profiler API
    LOG.warn("Profiler failed to start or execute", e);
    writeError(resp, HttpServletResponse.SC_INTERNAL_SERVER_ERROR,
        "Profiler error: " + e.getMessage()
          + ". Check that the async-profiler native library is compatible with this OS/kernel.");

This should cover every scenario.

- UnsatisfiedLinkError → caught by Error
- glibc missing versioned symbols → UnsatisfiedLinkError or Error → caught
- kernel disabled perf_event → throws at first execute() call → caught by IOException | RuntimeException
- seccomp policy → same
- InterruptedException is handled first and explicitly re-interrupts the thread — it never reaches the broader catch

…toggle, and cleanup

- Catch Error (UnsatisfiedLinkError, glibc mismatch, perf_event disabled, seccomp)
  in ProfileServlet.doGet so native load failures return a descriptive 500 instead
  of bubbling to the servlet container as a generic error
- Add hbase.profiler.enabled config (default true) to HttpServer so operators can
  disable the /prof endpoint via config even when the async-profiler binary is present
- Cache ProfilerBackend.detect() result in a static field so isAvailable() and the
  default constructor share one detection call instead of each triggering reflective
  class-loading at startup
- Reduce getAsyncProfilerHome() from public to package-private
- Add TestHttpServer#testProfilerDisabledByConfig to verify the new config flag

Co-authored-by: Claude <noreply@anthropic.com>
@mnpoonia mnpoonia force-pushed the hbase_async_profiler branch from cbf1ae2 to e55e963 Compare June 9, 2026 10:47
mnpoonia and others added 3 commits June 10, 2026 16:43
…inaryBackend, and flaky test

- profiling flag: use thisRequestSetProfiling to guard the finally-block
  reset, preventing 409 responses from clearing a running session's flag
- ProfilerCommandMapper.toCliCommand: route through toFormatString() so
  SVG is remapped to html for BinaryBackend (was hard-erroring on 4.x)
- ProfileServlet.createOutputFile: use toFormatString() for extension so
  SVG->HTML remap is reflected in the filename
- TestProfileServlet: fix flaky concurrent test (duration 1s -> 3600s),
  remove unused CountDownLatch dead code and stale comment
- TestProfilerCommandMapper: add testCliCommandSvgRemappedToHtml

Co-authored-by: Claude Code <claude@anthropic.com>
… Javadoc gaps

- ProfileServlet stopper: pad error message to >100 bytes so
  ProfileOutputServlet's size check treats it as complete and stops
  auto-refreshing the browser on stop failure
- DisabledServlet: accept a REASON_PARAM init-param so HttpServer can
  distinguish "disabled by config" from "library not available"; update
  HttpServer to pass the reason via ServletHolder init-param
- hbase-default.xml: document the security trade-off (default true
  auto-exposes /prof as privileged; set false on security-sensitive
  deployments)
- ProfilerBackend.detect() + ProfileServlet.isAvailable(): add Javadoc
  noting detection is one-shot at class-load, and that native
  incompatibility surfaces at request time not at startup

Co-authored-by: Claude Code <claude@anthropic.com>
…, and duplicate SVG warn

- doGet: catch IOException alongside Error|RuntimeException — AsyncProfiler.execute()
  throws IOException for invalid agent commands; it was escaping as a raw container 500
- toLibraryStopCommand: drop 'collapsed' from the extension-derived set and always emit
  it as a bare token — .collapsed is not auto-detected by async-profiler 4.x
  detectOutputFormat; without the token, collapsed output silently fell back to plain text
- ProfilerCommandMapper: split toFormatString (emits SVG deprecation LOG.warn) from
  toFileExtension (silent) — createOutputFile now uses toFileExtension to avoid a
  duplicate warning per SVG request
- TestProfileServlet: reduce concurrent-test duration from 3600s to 60s to avoid long-
  lived daemon threads accumulating across test runs
- TestProfilerCommandMapper: add testLibraryStopCommandCollapsed to lock in bare-token behavior

Co-authored-by: Claude Code <claude@anthropic.com>
@mnpoonia

Copy link
Copy Markdown
Contributor Author

@virajjasani @apurtell A gentle reminder.

mnpoonia and others added 6 commits June 12, 2026 14:42
…inaryBackend, and flaky test

- duration: clamp to [1, MAX_DURATION_SECONDS] — sleep(-1) threw
  IllegalArgumentException leaving the profiler started but never stopped
- orphan output file: move Files.write(empty) to after executeStart succeeds
  so a failed start does not leave a permanent empty file in OUTPUT_DIR
- cross-pid rejection: use SC_BAD_REQUEST (400) instead of SC_INTERNAL_SERVER_ERROR
  (500) — the request is malformed, not a server-side error
- ProfileRequest: add @InterfaceAudience.Private — inner classes do not inherit
  the enclosing class annotation
- TestProfileServlet: reduce concurrent-test duration from 60s to 5s; fix
  brittle assertEquals(0.5, ...) to use delta; add testDurationClampedToMinOne

Co-authored-by: Claude Code <claude@anthropic.com>
… and ?last tests

- Stopper thread: widen catch(Exception) to catch(Throwable) and move
  padded-error write to finally so InterruptedException and Error (e.g.
  UnsatisfiedLinkError at stop time) both terminate the output file
  correctly; ProfileOutputServlet now references shared
  PROF_OUTPUT_MIN_BYTES constant instead of a magic 100.

- ProfilerBackend.detect(): add ExceptionInInitializerError and a
  catch(Throwable) backstop so a broken async-profiler installation
  falls back to DisabledServlet instead of crashing DETECTED_BACKEND
  static init and making ProfileServlet permanently unloadable.

- BinaryBackend.executeStart(): replace ProcessUtils.getPid() fallback
  with ProcessHandle.current().pid() (Java 9+, always correct, no
  JVM_PID env-var dependency).

- doGet: write placeholder file before calling executeStart so a
  Files.write failure after a successful start does not leave the
  profiler running with profiling=false and no stopper thread.

- writeAcceptedResponse: receive pre-computed relativeUrl instead of
  recomputing it from outputFile, eliminating the divergence risk
  between the Refresh header and the ?last redirect URL.

- Tests: add ?last unit tests (404 when no result, redirect to cached
  URL, single-result overwrite semantics).
- Hoist outputFile declaration before try block in doGet so the catch
  block can reference it; delete the placeholder when !stopperStarted
  (covers Files.write throw, executeStart throw, and the rare OOM on
  t.start() where executeStart succeeded but the stopper never owned
  the file)
- Add @beforeeach clearLastResult() so static lastResult from one test
  cannot bleed into another; reduce conflict-test duration from 5s to
  1s to shrink the race window against the stopper thread
- Add testDoGetDeletesOrphanFileWhenExecuteStartThrows: captures the
  exact output file via ArgumentCaptor and asserts it is deleted after
  executeStart throws
- Add testStopperThreadWritesPaddedErrorOnExecuteStopFailure: verifies
  the stopper writes a padded (>PROF_OUTPUT_MIN_BYTES) error file when
  executeStop throws, preventing infinite browser auto-refresh
…rrupt path

- Skip writeError when resp.isCommitted() to avoid calling setStatus(500)
  on a response already partially sent (e.g. writeAcceptedResponse threw
  after writing the 202 status line); the stopper thread still runs to
  completion in that case so profiling completes correctly on the server
- Add testStopperThreadWritesPaddedErrorOnInterrupt: finds the stopper
  thread by name while it is sleeping a 60s duration, interrupts it, and
  asserts the output file grows to > PROF_OUTPUT_MIN_BYTES containing
  "interrupted", verifying ProfileOutputServlet will stop auto-refreshing
… hardening

Correctness (C-group):
- Make profilerLock and profiling static — both gate the JVM-global AsyncProfiler
  singleton and must be shared across servlet instances
- Add best-effort executeStop in doGet catch when executeStart succeeded but
  startStopperThread failed — prevents LibraryBackend staying in "started" state
- Add best-effort executeStop in stopper InterruptedException path (C3)
- Hoist profiling=false to top of stopper finally so Files.write Error cannot skip it (C4)
- Pad output file after successful executeStop if size < PROF_OUTPUT_MIN_BYTES (C5)
- BinaryBackend.executeStop: replace waitFor() with waitFor(duration+30s,SECONDS)
  + destroyForcibly on timeout to prevent hung asprof bricking the endpoint (C6)
- init() logs WARN instead of throwing ServletException on unwritable tmpdir (C7)
- ?last checks Files.exists before redirecting; clears stale pointer and returns
  404 if the output file was removed by OS cleanup (C8)

Security (S-group):
- Add AdminAuthorizedFilter to /prof-output-hbase ServletContextHandler when
  authenticationEnabled — prevents unauthenticated reads of profiling output (S1)
- ensureOutputDir() creates OUTPUT_DIR with 0700 POSIX perms (falls back on
  non-POSIX); placeholder write uses CREATE_NEW (O_CREAT|O_EXCL) to reject
  pre-planted symlinks instead of following them (S2)

Input validation:
- Reject pid <= 0 with HTTP 400 in doGet before the lock, covering both backends

Test hardening (T-group):
- @AfterEach joins any live ProfileServlet-stopper thread so lastResult is fully
  settled before the next test's @beforeeach reset runs (T1)
- Conflict test mocks executeStop to block on CountDownLatch released after the
  second doGet call — eliminates the timing race on slow CI runners (T2)
- Add testDoGetRejectsNonPositivePidWith400 covering pid=-1, 0, -999

Minor:
- Lock timeout 3s -> 100ms; lock-busy path returns 409 not 500
- ProfilerBackend.detect logs WARN in swallowed UnsatisfiedLinkError branch
- LibraryBackend.destroy logs LOG.debug on swallowed stop exception
- 202 body notes output=svg->html remap when SVG was requested
…add NPE regression test

Fix NPE in AdminAuthorizedFilter.init() when the prof-output-hbase
ServletContextHandler is missing CONF_CONTEXT_ATTRIBUTE and ADMINS_ACL.
Call setContextAttributes(genCtx, conf) unconditionally so the filter can
read conf and acl; add a test that GETs a planted profile file with auth
disabled to assert 200 (not 500).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants