Skip to content

Optimize Kafka topic filtering for large clusters (370x+ speedup)#1866

Open
andpol wants to merge 1 commit into
kafbat:mainfrom
andpol:issues/1776
Open

Optimize Kafka topic filtering for large clusters (370x+ speedup)#1866
andpol wants to merge 1 commit into
kafbat:mainfrom
andpol:issues/1776

Conversation

@andpol

@andpol andpol commented May 29, 2026

Copy link
Copy Markdown
  • Breaking change? (if so, please describe the impact and migration path for existing application instances)

What changes did you make? (Give an overview)

Fixes #1776

topicStateMap called filterTopic once per topic for each offsets/stats map, each doing an O(P_total) scan — total O(T * P_total) per scrape. On large clusters this was the CPU hotspot behind slow UI.

Fix: group each cluster-wide map by topic once (O(P_total) total), then do O(1) lookups in the per-topic loop.

Measured speedup (partitions = 10 * topics, median per call):
1K topics: 373ms -> 1ms (~370x)
3K topics: 3.2s -> 4ms (~800x)
10K topics: 61s -> 14ms (~4400x)

Is there anything you'd like reviewers to focus on?

Review correctness of changes, and effectiveness of tests, as I'm not very familiar with Kafka UI code.

How Has This Been Tested? (put an "x" (case-sensitive!) next to an item)

  • No need to
  • Manually (please, describe, if necessary) - I deployed to our staging environment. The 1.5.0 release of Kafka UI is very obviously much slower at opening the topics page (and other things). I also ran the attached, but not committed benchmarking ScrapedClusterStatePerfTest.java (place it in api/src/test/java/io/kafbat/ui/service/metrics/scrape/ScrapedClusterStatePerfTest.java if you want to run). See results above.
  • Unit checks - new tests to validate correctness of touched functions. They pass before my changes, and also after.
  • Integration checks
  • Covered by existing automation

Checklist (put an "x" (case-sensitive!) next to all the items, otherwise the build will fail)

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (e.g. ENVIRONMENT VARIABLES)
  • [x2] My changes generate no new warnings (e.g. Sonar is happy)
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged

Check out Contributing and Code of Conduct

A picture of a cute animal (not mandatory but encouraged)

image

Summary by CodeRabbit

  • Refactor

    • Optimized internal metrics data processing logic for improved efficiency
  • Tests

    • Expanded test coverage for metrics scraping and state validation to ensure reliability

Review Change Stack

Fixes kafbat#1776

topicStateMap called filterTopic once per topic for each offsets/stats
map, each doing an O(P_total) scan — total O(T * P_total) per scrape.
On large clusters this was the CPU hotspot behind slow UI.

Fix: group each cluster-wide map by topic once (O(P_total) total), then
do O(1) lookups in the per-topic loop.

Measured speedup (partitions = 10 * topics, median per call):
  1K topics:    373ms -> 1ms   (~370x)
  3K topics:    3.2s  -> 4ms   (~800x)
  10K topics:   61s   -> 14ms  (~4400x)
@andpol andpol requested a review from a team as a code owner May 29, 2026 21:24
@kapybro kapybro Bot added status/triage/manual Manual triage in progress and removed status/triage/manual Manual triage in progress labels May 29, 2026
@kapybro

kapybro Bot commented May 29, 2026

Copy link
Copy Markdown

AI Summary

The GitHub issue addresses a performance bottleneck in Kafka UI, where the topicStateMap function performed an inefficient O(T * P_total) scan for each topic, causing slow UI responses on large Kafka clusters. The fix optimizes the process by grouping cluster-wide maps by topic once (O(P_total) total) and using O(1) lookups, resulting in significant speedups (e.g., 370x for 1K topics, 4400x for 10K topics). The change was tested manually in staging and with unit tests, and reviewers are asked to verify correctness and test effectiveness.

@coderabbitai

coderabbitai Bot commented May 29, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

ScrapedClusterState refactors how it builds per-topic state maps by replacing per-topic filtering and Optional-wrapped handling with a new groupByTopic helper that converts TopicPartition → value maps into nested topic → partition → value structures. The topicStateMap method visibility changes to package-private, the unused Optional import is removed, and test coverage validates the refactored behavior end-to-end.

Changes

ScrapedClusterState refactoring with groupByTopic helper

Layer / File(s) Summary
groupByTopic helper foundation
api/src/main/java/io/kafbat/ui/service/metrics/scrape/ScrapedClusterState.java
New groupByTopic(Map<TopicPartition, T>) generic utility converts TopicPartition-keyed maps into nested topic -> partition -> value structure for reuse across offset and stats lookups.
topicStateMap refactoring and cleanup
api/src/main/java/io/kafbat/ui/service/metrics/scrape/ScrapedClusterState.java
topicStateMap(...) visibility changes to package-private static, unused java.util.Optional import is removed, and method implementation refactored to use grouped lookup maps via groupByTopic instead of per-topic filtering and Optional-wrapped partition stats.
Test coverage for refactored topicStateMap
api/src/test/java/io/kafbat/ui/service/metrics/scrape/ScrapedClusterStateTest.java
New test topicStateMapGroupsOffsetsAndStatsPerTopic builds synthetic partition/topic stats and metadata, invokes the refactored topicStateMap, and validates returned topic-state map contains expected keys with correct descriptions, configs, start/end offsets, segment stats, and partition-to-segment mappings. Adds reflection-based test helpers to construct InternalLogDirStats and a TopicDescription factory.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested labels

type/enhancement, scope/backend, status/triage/completed, area/internal

Poem

A rabbit refactors with care,
groupByTopic groups everywhere,
No Optionals linger—
Just nested map fingers,
Per-topic lookups now bright and fair. 🐰✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning The PR addresses only a micro-optimization in topic lookup but does not implement the core requirements from issue #1776: making metadata/metrics fetching non-blocking, decoupling from request threads, throttling/staggering requests, or providing configurable tuning knobs. Implement the primary objectives from #1776: decouple blocking fetch operations from request-handling threads, add async/reactive patterns or dedicated worker pools, throttle/paginate metadata requests, and provide configurable tuning parameters for large clusters.
Docstring Coverage ⚠️ Warning Docstring coverage is 10.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Out of Scope Changes check ✅ Passed All changes are scoped to the topic lookup optimization: refactoring ScrapedClusterState to use groupByTopic instead of filterTopic, and corresponding test updates. No unrelated changes detected.
Title check ✅ Passed The title accurately describes the main optimization refactoring: replacing per-topic filtering loops with grouped map lookups for Kafka topic state processing.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@kapybro kapybro Bot changed the title Improve perf w/ large Kafka clusters Speed up UI by optimizing topic lookup in large Kafka clusters May 29, 2026

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi andpol! 👋

Welcome, and thank you for opening your first PR in the repo!

Please wait for triaging by our maintainers.

Please take a look at our contributing guide.

@kapybro kapybro Bot added status/triage/manual Manual triage in progress and removed status/triage/manual Manual triage in progress labels May 31, 2026
@kapybro kapybro Bot changed the title Speed up UI by optimizing topic lookup in large Kafka clusters Optimize Kafka topic filtering for large clusters (370x+ speedup) May 31, 2026
@kapybro kapybro Bot added area/topics impact/changelog A PR with changes which should be addressed in the changelog explicitly scope/backend Related to backend changes type/enhancement En enhancement/improvement to an already existing feature labels May 31, 2026
@kafbat kafbat deleted a comment from kapybro Bot May 31, 2026
@Haarolean Haarolean added this to the 1.6 milestone May 31, 2026
@Haarolean Haarolean requested a review from germanosin June 16, 2026 14:04
@Haarolean Haarolean moved this from Todo to In Review in Release 1.6 Jun 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/topics impact/changelog A PR with changes which should be addressed in the changelog explicitly scope/backend Related to backend changes type/enhancement En enhancement/improvement to an already existing feature

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

UI is unresponsive when fetching (meta)data from large Kafka cluster

2 participants