Skip to content

chore(evals): Update model evaluations 2026-06-30#143

Open
rhacs-bot wants to merge 1 commit into
mainfrom
chore/update-model-evaluation-2026-06-30
Open

chore(evals): Update model evaluations 2026-06-30#143
rhacs-bot wants to merge 1 commit into
mainfrom
chore/update-model-evaluation-2026-06-30

Conversation

@rhacs-bot

Copy link
Copy Markdown
Contributor

Automated weekly model evaluation update.

Models evaluated: gpt-5-mini
Date: 2026-06-30

This PR was automatically generated by the Model Evaluation workflow.

@rhacs-bot rhacs-bot requested a review from janisz as a code owner June 30, 2026 07:43
@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: ac720ebf-c6f3-443b-a721-2e17bb122f6f

📥 Commits

Reviewing files that changed from the base of the PR and between dd48317 and 1b93cd6.

📒 Files selected for processing (1)
  • docs/model-evaluation.md

📝 Walkthrough

Summary by CodeRabbit

  • Documentation
    • Updated the model evaluation report with the latest results.
    • Replaced the previous dated run with a new entry showing a slightly lower overall score.
    • Refreshed task outcomes and token usage figures to reflect the latest evaluation.

Walkthrough

The gpt-5-mini evaluation section in docs/model-evaluation.md is updated from a 2026-06-16 run to a 2026-06-30 run. The overall pass rate changes from 100% to 90%, cve-multiple changes from Pass to Fail, rhsa-not-supported and cve-nonexistent change from Fail to Pass, and token counts are revised.

Changes

gpt-5-mini Evaluation Results Update

Layer / File(s) Summary
Evaluation section refresh
docs/model-evaluation.md
Replaces the 2026-06-16 dated block with a 2026-06-30 block: overall score drops to 10/11 (90%), cve-multiple flips to Fail, rhsa-not-supported and cve-nonexistent flip to Pass, and input/output token totals are updated.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the dated model evaluation update.
Description check ✅ Passed The description matches the automated weekly model evaluation update and date.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chore/update-model-evaluation-2026-06-30

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@codecov-commenter

codecov-commenter commented Jun 30, 2026

Copy link
Copy Markdown

❌ 2 Tests Failed:

Tests completed Failed Passed Skipped
380 2 378 12
View the full list of 2 ❄️ flaky test(s)
::policy 1

Flake rate in main: 100.00% (Passed 0 times, Failed 54 times)

Stack Traces | 0s run time
- test violation 1
- test violation 2
- test violation 3
::policy 4

Flake rate in main: 100.00% (Passed 0 times, Failed 54 times)

Stack Traces | 0s run time
- testing multiple alert violation messages 1
- testing multiple alert violation messages 2
- testing multiple alert violation messages 3

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@github-actions

Copy link
Copy Markdown

E2E Test Results

Commit: 1b93cd6
Workflow Run: View Details
Artifacts: Download test results & logs

=== Evaluation Summary ===

  ✓ cve-cluster-list (assertions: 3/3)
  ✓ cve-detected-workloads (assertions: 3/3)
  ✓ cve-cluster-does-not-exist (assertions: 3/3)
  ✓ rhsa-not-supported (assertions: 2/2)
  ✓ cve-clusters-general (assertions: 3/3)
  ✓ cve-cluster-does-exist (assertions: 3/3)
  ✓ cve-detected-clusters (assertions: 3/3)
  ✗ cve-nonexistent (assertions: 3/3)
      one or more verification steps failed
  ✓ cve-multiple (assertions: 3/3)
  ✓ list-clusters (assertions: 3/3)
  ✓ cve-log4shell (assertions: 3/3)

Tasks:      10/11 passed (90.91%)
Assertions: 32/32 passed (100.00%)
Tokens:     ~52715 (estimate - excludes system prompt & cache)
MCP schemas: ~12562 (included in token total)
Agent used tokens:
  Input:  13040 tokens
  Output: 21243 tokens
Judge used tokens:
  Input:  43999 tokens
  Output: 36148 tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants