chore(evals): Update model evaluations 2026-06-30 by rhacs-bot · Pull Request #143 · stackrox/stackrox-mcp

rhacs-bot · 2026-06-30T07:43:04Z

Automated weekly model evaluation update.

Models evaluated: gpt-5-mini
Date: 2026-06-30

This PR was automatically generated by the Model Evaluation workflow.

coderabbitai · 2026-06-30T07:44:30Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: ac720ebf-c6f3-443b-a721-2e17bb122f6f

📥 Commits

Reviewing files that changed from the base of the PR and between dd48317 and 1b93cd6.

📒 Files selected for processing (1)

docs/model-evaluation.md

📝 Walkthrough

Summary by CodeRabbit

Documentation
- Updated the model evaluation report with the latest results.
- Replaced the previous dated run with a new entry showing a slightly lower overall score.
- Refreshed task outcomes and token usage figures to reflect the latest evaluation.

Walkthrough

The gpt-5-mini evaluation section in docs/model-evaluation.md is updated from a 2026-06-16 run to a 2026-06-30 run. The overall pass rate changes from 100% to 90%, cve-multiple changes from Pass to Fail, rhsa-not-supported and cve-nonexistent change from Fail to Pass, and token counts are revised.

Changes

gpt-5-mini Evaluation Results Update

Layer / File(s)	Summary
Evaluation section refresh `docs/model-evaluation.md`	Replaces the `2026-06-16` dated block with a `2026-06-30` block: overall score drops to 10/11 (90%), `cve-multiple` flips to Fail, `rhsa-not-supported` and `cve-nonexistent` flip to Pass, and input/output token totals are updated.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly identifies the dated model evaluation update.
Description check	✅ Passed	The description matches the automated weekly model evaluation update and date.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chore/update-model-evaluation-2026-06-30

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

codecov-commenter · 2026-06-30T07:47:33Z

❌ 2 Tests Failed:

Tests completed	Failed	Passed	Skipped
380	2	378	12

View the full list of 2 ❄️ flaky test(s)

::policy 1
Flake rate in main: 100.00% (Passed 0 times, Failed 54 times)
Stack Traces | 0s run time
- test violation 1
- test violation 2
- test violation 3

::policy 4
Flake rate in main: 100.00% (Passed 0 times, Failed 54 times)
Stack Traces | 0s run time
- testing multiple alert violation messages 1
- testing multiple alert violation messages 2
- testing multiple alert violation messages 3

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

github-actions · 2026-06-30T07:53:30Z

E2E Test Results

Commit: 1b93cd6
Workflow Run: View Details
Artifacts: Download test results & logs

=== Evaluation Summary ===

  ✓ cve-cluster-list (assertions: 3/3)
  ✓ cve-detected-workloads (assertions: 3/3)
  ✓ cve-cluster-does-not-exist (assertions: 3/3)
  ✓ rhsa-not-supported (assertions: 2/2)
  ✓ cve-clusters-general (assertions: 3/3)
  ✓ cve-cluster-does-exist (assertions: 3/3)
  ✓ cve-detected-clusters (assertions: 3/3)
  ✗ cve-nonexistent (assertions: 3/3)
      one or more verification steps failed
  ✓ cve-multiple (assertions: 3/3)
  ✓ list-clusters (assertions: 3/3)
  ✓ cve-log4shell (assertions: 3/3)

Tasks:      10/11 passed (90.91%)
Assertions: 32/32 passed (100.00%)
Tokens:     ~52715 (estimate - excludes system prompt & cache)
MCP schemas: ~12562 (included in token total)
Agent used tokens:
  Input:  13040 tokens
  Output: 21243 tokens
Judge used tokens:
  Input:  43999 tokens
  Output: 36148 tokens

Update model evaluations 2026-06-30

1b93cd6

rhacs-bot requested a review from janisz as a code owner June 30, 2026 07:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore(evals): Update model evaluations 2026-06-30#143

chore(evals): Update model evaluations 2026-06-30#143
rhacs-bot wants to merge 1 commit into
mainfrom
chore/update-model-evaluation-2026-06-30

rhacs-bot commented Jun 30, 2026

Uh oh!

coderabbitai Bot commented Jun 30, 2026

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Uh oh!

codecov-commenter commented Jun 30, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

rhacs-bot commented Jun 30, 2026

Uh oh!

coderabbitai Bot commented Jun 30, 2026

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Uh oh!

codecov-commenter commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ 2 Tests Failed:

Uh oh!

github-actions Bot commented Jun 30, 2026

E2E Test Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Jun 30, 2026 •

edited

Loading