HBASE-29853 Move from async profiler binary to a maven dependency#7679
HBASE-29853 Move from async profiler binary to a maven dependency#7679mnpoonia wants to merge 12 commits into
Conversation
This comment has been minimized.
This comment has been minimized.
91ff534 to
8a7bcaa
Compare
|
🎊 +1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
Async profiler is an optional dependency. The availability of a profiler on the server side can be a security risk and so some (or many) may opt not to place one there. This change does not gracefully handle the absence of the profiler nor make its deployment optional. |
8a7bcaa to
268b567
Compare
268b567 to
05919a4
Compare
Good point. Updated PR addresses both concerns: the dependency is now marked true so it is never transitively pulled into downstream modules. Users who want to bundle it can activate |
|
@apurtell FYI. Hopefully i have addressed the concerns. |
|
@virajjasani Please review |
There was a problem hiding this comment.
Almost there. There's a first use case where we could be more graceful in explaining the error.
Also consider a hbase.profiler.enabled = < true | false > configuration setting that allows operators to selectively disable the profiler even if the binaries are present in the deployment. Just a thought.
| startStopperThread(request.getDuration(), request, outputFile); | ||
|
|
||
| writeAcceptedResponse(resp, request, outputFile); | ||
| } catch (InterruptedException e) { |
There was a problem hiding this comment.
The native library won't actually be loaded until first use, which will happen in doGet, and there could be other native layer issues such as a glibc missing some versioned symbols, kernel disabled perf_event support, a seccomp policy, etc. Catching Error here, or even Throwable and then testing for instanceof InterruptedException to deal with that specific case would handle all of this. Otherwise UnsatisfiedLinkError (most likely) will bubble up to the servlet container which will emit a generic 500 error response.
There was a problem hiding this comment.
} catch (InterruptedException e) {
Thread.currentThread().interrupt(); // handled first, specifically
...
} catch (IOException | Error | RuntimeException e) {
// Catches:
// - IOException: AsyncProfiler.execute() throws IOException for invalid agent commands
// - UnsatisfiedLinkError / other Error: native lib absent or incompatible OS/kernel
// - IllegalStateException / IllegalArgumentException (RuntimeException): double-start,
// unsupported event, rejected format from the profiler API
LOG.warn("Profiler failed to start or execute", e);
writeError(resp, HttpServletResponse.SC_INTERNAL_SERVER_ERROR,
"Profiler error: " + e.getMessage()
+ ". Check that the async-profiler native library is compatible with this OS/kernel.");
This should cover every scenario.
- UnsatisfiedLinkError → caught by Error
- glibc missing versioned symbols → UnsatisfiedLinkError or Error → caught
- kernel disabled perf_event → throws at first execute() call → caught by IOException | RuntimeException
- seccomp policy → same
- InterruptedException is handled first and explicitly re-interrupts the thread — it never reaches the broader catch
…toggle, and cleanup - Catch Error (UnsatisfiedLinkError, glibc mismatch, perf_event disabled, seccomp) in ProfileServlet.doGet so native load failures return a descriptive 500 instead of bubbling to the servlet container as a generic error - Add hbase.profiler.enabled config (default true) to HttpServer so operators can disable the /prof endpoint via config even when the async-profiler binary is present - Cache ProfilerBackend.detect() result in a static field so isAvailable() and the default constructor share one detection call instead of each triggering reflective class-loading at startup - Reduce getAsyncProfilerHome() from public to package-private - Add TestHttpServer#testProfilerDisabledByConfig to verify the new config flag Co-authored-by: Claude <noreply@anthropic.com>
cbf1ae2 to
e55e963
Compare
…inaryBackend, and flaky test - profiling flag: use thisRequestSetProfiling to guard the finally-block reset, preventing 409 responses from clearing a running session's flag - ProfilerCommandMapper.toCliCommand: route through toFormatString() so SVG is remapped to html for BinaryBackend (was hard-erroring on 4.x) - ProfileServlet.createOutputFile: use toFormatString() for extension so SVG->HTML remap is reflected in the filename - TestProfileServlet: fix flaky concurrent test (duration 1s -> 3600s), remove unused CountDownLatch dead code and stale comment - TestProfilerCommandMapper: add testCliCommandSvgRemappedToHtml Co-authored-by: Claude Code <claude@anthropic.com>
… Javadoc gaps - ProfileServlet stopper: pad error message to >100 bytes so ProfileOutputServlet's size check treats it as complete and stops auto-refreshing the browser on stop failure - DisabledServlet: accept a REASON_PARAM init-param so HttpServer can distinguish "disabled by config" from "library not available"; update HttpServer to pass the reason via ServletHolder init-param - hbase-default.xml: document the security trade-off (default true auto-exposes /prof as privileged; set false on security-sensitive deployments) - ProfilerBackend.detect() + ProfileServlet.isAvailable(): add Javadoc noting detection is one-shot at class-load, and that native incompatibility surfaces at request time not at startup Co-authored-by: Claude Code <claude@anthropic.com>
…, and duplicate SVG warn - doGet: catch IOException alongside Error|RuntimeException — AsyncProfiler.execute() throws IOException for invalid agent commands; it was escaping as a raw container 500 - toLibraryStopCommand: drop 'collapsed' from the extension-derived set and always emit it as a bare token — .collapsed is not auto-detected by async-profiler 4.x detectOutputFormat; without the token, collapsed output silently fell back to plain text - ProfilerCommandMapper: split toFormatString (emits SVG deprecation LOG.warn) from toFileExtension (silent) — createOutputFile now uses toFileExtension to avoid a duplicate warning per SVG request - TestProfileServlet: reduce concurrent-test duration from 3600s to 60s to avoid long- lived daemon threads accumulating across test runs - TestProfilerCommandMapper: add testLibraryStopCommandCollapsed to lock in bare-token behavior Co-authored-by: Claude Code <claude@anthropic.com>
|
@virajjasani @apurtell A gentle reminder. |
…inaryBackend, and flaky test - duration: clamp to [1, MAX_DURATION_SECONDS] — sleep(-1) threw IllegalArgumentException leaving the profiler started but never stopped - orphan output file: move Files.write(empty) to after executeStart succeeds so a failed start does not leave a permanent empty file in OUTPUT_DIR - cross-pid rejection: use SC_BAD_REQUEST (400) instead of SC_INTERNAL_SERVER_ERROR (500) — the request is malformed, not a server-side error - ProfileRequest: add @InterfaceAudience.Private — inner classes do not inherit the enclosing class annotation - TestProfileServlet: reduce concurrent-test duration from 60s to 5s; fix brittle assertEquals(0.5, ...) to use delta; add testDurationClampedToMinOne Co-authored-by: Claude Code <claude@anthropic.com>
… and ?last tests - Stopper thread: widen catch(Exception) to catch(Throwable) and move padded-error write to finally so InterruptedException and Error (e.g. UnsatisfiedLinkError at stop time) both terminate the output file correctly; ProfileOutputServlet now references shared PROF_OUTPUT_MIN_BYTES constant instead of a magic 100. - ProfilerBackend.detect(): add ExceptionInInitializerError and a catch(Throwable) backstop so a broken async-profiler installation falls back to DisabledServlet instead of crashing DETECTED_BACKEND static init and making ProfileServlet permanently unloadable. - BinaryBackend.executeStart(): replace ProcessUtils.getPid() fallback with ProcessHandle.current().pid() (Java 9+, always correct, no JVM_PID env-var dependency). - doGet: write placeholder file before calling executeStart so a Files.write failure after a successful start does not leave the profiler running with profiling=false and no stopper thread. - writeAcceptedResponse: receive pre-computed relativeUrl instead of recomputing it from outputFile, eliminating the divergence risk between the Refresh header and the ?last redirect URL. - Tests: add ?last unit tests (404 when no result, redirect to cached URL, single-result overwrite semantics).
- Hoist outputFile declaration before try block in doGet so the catch block can reference it; delete the placeholder when !stopperStarted (covers Files.write throw, executeStart throw, and the rare OOM on t.start() where executeStart succeeded but the stopper never owned the file) - Add @beforeeach clearLastResult() so static lastResult from one test cannot bleed into another; reduce conflict-test duration from 5s to 1s to shrink the race window against the stopper thread - Add testDoGetDeletesOrphanFileWhenExecuteStartThrows: captures the exact output file via ArgumentCaptor and asserts it is deleted after executeStart throws - Add testStopperThreadWritesPaddedErrorOnExecuteStopFailure: verifies the stopper writes a padded (>PROF_OUTPUT_MIN_BYTES) error file when executeStop throws, preventing infinite browser auto-refresh
…rrupt path - Skip writeError when resp.isCommitted() to avoid calling setStatus(500) on a response already partially sent (e.g. writeAcceptedResponse threw after writing the 202 status line); the stopper thread still runs to completion in that case so profiling completes correctly on the server - Add testStopperThreadWritesPaddedErrorOnInterrupt: finds the stopper thread by name while it is sleeping a 60s duration, interrupts it, and asserts the output file grows to > PROF_OUTPUT_MIN_BYTES containing "interrupted", verifying ProfileOutputServlet will stop auto-refreshing
… hardening Correctness (C-group): - Make profilerLock and profiling static — both gate the JVM-global AsyncProfiler singleton and must be shared across servlet instances - Add best-effort executeStop in doGet catch when executeStart succeeded but startStopperThread failed — prevents LibraryBackend staying in "started" state - Add best-effort executeStop in stopper InterruptedException path (C3) - Hoist profiling=false to top of stopper finally so Files.write Error cannot skip it (C4) - Pad output file after successful executeStop if size < PROF_OUTPUT_MIN_BYTES (C5) - BinaryBackend.executeStop: replace waitFor() with waitFor(duration+30s,SECONDS) + destroyForcibly on timeout to prevent hung asprof bricking the endpoint (C6) - init() logs WARN instead of throwing ServletException on unwritable tmpdir (C7) - ?last checks Files.exists before redirecting; clears stale pointer and returns 404 if the output file was removed by OS cleanup (C8) Security (S-group): - Add AdminAuthorizedFilter to /prof-output-hbase ServletContextHandler when authenticationEnabled — prevents unauthenticated reads of profiling output (S1) - ensureOutputDir() creates OUTPUT_DIR with 0700 POSIX perms (falls back on non-POSIX); placeholder write uses CREATE_NEW (O_CREAT|O_EXCL) to reject pre-planted symlinks instead of following them (S2) Input validation: - Reject pid <= 0 with HTTP 400 in doGet before the lock, covering both backends Test hardening (T-group): - @AfterEach joins any live ProfileServlet-stopper thread so lastResult is fully settled before the next test's @beforeeach reset runs (T1) - Conflict test mocks executeStop to block on CountDownLatch released after the second doGet call — eliminates the timing race on slow CI runners (T2) - Add testDoGetRejectsNonPositivePidWith400 covering pid=-1, 0, -999 Minor: - Lock timeout 3s -> 100ms; lock-busy path returns 409 not 500 - ProfilerBackend.detect logs WARN in swallowed UnsatisfiedLinkError branch - LibraryBackend.destroy logs LOG.debug on swallowed stop exception - 202 body notes output=svg->html remap when SVG was requested
…add NPE regression test Fix NPE in AdminAuthorizedFilter.init() when the prof-output-hbase ServletContextHandler is missing CONF_CONTEXT_ATTRIBUTE and ADMINS_ACL. Call setContextAttributes(genCtx, conf) unconditionally so the filter can read conf and acl; add a test that GETs a planted profile file with auth disabled to assert 200 (not 500).
Summary
Migrates async-profiler from an optional external binary (
ASYNC_PROFILER_HOME) to an optional Maven dependency (-Pasync-profiler), eliminating the need to manually install async-profiler on each node.Architecture
Introduces a
ProfilerBackendabstraction with two implementations:LibraryBackend— in-process viaAsyncProfiler.getInstance()(used when-Pasync-profilerbuilds the Maven dep onto the classpath). Profiles only the current JVM.BinaryBackend— externalasprof/profiler.shprocess (used whenASYNC_PROFILER_HOMEis set). Supports cross-process profiling via?pid=<pid>.LibraryBackendis preferred when both are available. Detection runs once at class-load time — a restart is required if the library is added later.Behavior changes from master
?pid=is no longer required. LibraryBackend usesProcessHandle.current().pid(); BinaryBackend usesProcessHandle.current().pid()as the fallback (Java 9+, always correct — removed the legacyProcessUtils.getPid()/JVM_PIDenv-var path that could resolve to a stale or incorrect value).?pid=<external>rejected for LibraryBackend: returns HTTP 400. LibraryBackend can only profile the current JVM. BinaryBackend still accepts external PIDs.output=svgremapped tohtml: SVG was dropped in async-profiler 2.0; a deprecation warning is logged.bufsize/width/heightignored by LibraryBackend: not recognized by async-profiler 4.x agent grammar; a note is included in the 202 response body if supplied. Still honored by BinaryBackend CLI.?last— view last result:GET /prof?lastredirects to the most recently completed profiling output without starting a new session. The last result is cached in memory (survives servlet reloads within the same JVM).Build / upgrade notes
mvn install) has no behavior change —ASYNC_PROFILER_HOMEpath still works.-Pasync-profilerto build with the Maven dependency. When present,/profis enabled by default (hbase.profiler.enabled=true). Set tofalseto opt out.perf_event_paranoid, seccomp) is not checked at startup — incompatible environments surface as HTTP 500 on first use with a descriptive message.-Pasync-profiler,/profis now enabled by default. Sethbase.profiler.enabled=falseto opt out.ProcessHandle.current().pid()requires JDK 9+. master targets JDK 17 so this is fine. Backports to JDK 8 branches are not supported.Correctness / robustness
Files.write(empty placeholder)runs beforeexecuteStart. IfexecuteStartthrows (or the stopper thread fails to start), the placeholder is deleted from the catch block — guarded by!stopperStarted && outputFile != nullto cover all three orphan paths (write throws, executeStart throws, t.start() OOM).RuntimeException/ErrorfromexecuteStopandInterruptedExceptionare caught; the output file is padded to> PROF_OUTPUT_MIN_BYTESsoProfileOutputServletstops auto-refreshing instead of looping forever.writeAcceptedResponsethrows after writing the 202 status line (e.g., broken pipe), the catch block skipswriteErrorwhenresp.isCommitted()to avoid silently swallowed status codes.detect()safety: widened catch to includeExceptionInInitializerError(not a subclass ofUnsatisfiedLinkError) and acatch(Throwable)backstop, so a broken async-profiler JAR at class-load time falls back toDisabledServletrather than crashingDETECTED_BACKENDpermanently.Known limitations
profiling=trueheld for full duration even on early asprof crash (pre-existing behavior, not introduced by this PR): the stopper thread sleeps fordurationSecondsbefore callingexecuteStop, so ifasprofexits immediately (bad event name, permission denied), the/profendpoint remains blocked for the configured duration. LibraryBackend is not affected. Tracked as a follow-up.profiler.mdxwebsite docs are stale (tracked in follow-up JIRA).Tests
Added unit tests for: concurrency guard (409), error paths (
UnsatisfiedLinkError,RuntimeException), enum fallbacks (unknown output/event), duration clamping,DisabledServletreason messages, collapsed bare token, SVG remap, library command grammar,?lastredirect, last result cache, orphan file deletion onexecuteStartfailure, stopper padded-error onexecuteStopfailure, stopper padded-error on interrupt.