Skip to content

chore(evals): Update model evaluations 2026-06-02#136

Open
rhacs-bot wants to merge 1 commit into
mainfrom
chore/update-model-evaluation-2026-06-02
Open

chore(evals): Update model evaluations 2026-06-02#136
rhacs-bot wants to merge 1 commit into
mainfrom
chore/update-model-evaluation-2026-06-02

Conversation

@rhacs-bot
Copy link
Copy Markdown
Contributor

Automated weekly model evaluation update.

Models evaluated: gpt-5-mini
Date: 2026-06-02

This PR was automatically generated by the Model Evaluation workflow.

@rhacs-bot rhacs-bot requested a review from janisz as a code owner June 2, 2026 08:06
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 2, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: f7a679be-86f1-4d16-b869-c645ba796b6c

📥 Commits

Reviewing files that changed from the base of the PR and between 81ce9af and ba8afd0.

📒 Files selected for processing (1)
  • docs/model-evaluation.md

📝 Walkthrough

Summary by CodeRabbit

  • Documentation
    • Updated model evaluation results with improved performance metrics from the latest evaluation run.

Walkthrough

This PR updates the gpt-5-mini evaluation results in the model evaluation documentation. The dated results block for 2026-05-26 (showing 10/11 tasks passed) is replaced with 2026-06-02 results (showing 11/11 tasks passed), including an updated per-task result table and token usage metrics.

Changes

Cohort / File(s) Summary
Evaluation Results Documentation
docs/model-evaluation.md
Updated gpt-5-mini evaluation metrics from 2026-05-26 run (90% pass rate, 10/11 tasks) to 2026-06-02 run (100% pass rate, 11/11 tasks), including revised per-task results table and token counts.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: updating model evaluation results with a specific date (2026-06-02).
Description check ✅ Passed The description is relevant to the changeset, providing context about the automated evaluation update, models tested, and evaluation date.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chore/update-model-evaluation-2026-06-02

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jun 2, 2026

❌ 2 Tests Failed:

Tests completed Failed Passed Skipped
380 2 378 12
View the full list of 2 ❄️ flaky test(s)
::policy 1

Flake rate in main: 100.00% (Passed 0 times, Failed 46 times)

Stack Traces | 0s run time
- test violation 1
- test violation 2
- test violation 3
::policy 4

Flake rate in main: 100.00% (Passed 0 times, Failed 46 times)

Stack Traces | 0s run time
- testing multiple alert violation messages 1
- testing multiple alert violation messages 2
- testing multiple alert violation messages 3

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 2, 2026

E2E Test Results

Commit: ba8afd0
Workflow Run: View Details
Artifacts: Download test results & logs

=== Evaluation Summary ===

  ✓ list-clusters (assertions: 3/3)
  ✓ cve-log4shell (assertions: 3/3)
  ✓ cve-multiple (assertions: 3/3)
  ✓ cve-detected-clusters (assertions: 3/3)
  ✓ cve-detected-workloads (assertions: 3/3)
  ✓ cve-nonexistent (assertions: 3/3)
  ✓ cve-clusters-general (assertions: 3/3)
  ✓ cve-cluster-list (assertions: 3/3)
  ✓ cve-cluster-does-exist (assertions: 3/3)
  ✓ rhsa-not-supported (assertions: 2/2)
  ✓ cve-cluster-does-not-exist (assertions: 3/3)

Tasks:      11/11 passed (100.00%)
Assertions: 32/32 passed (100.00%)
Tokens:     ~51728 (estimate - excludes system prompt & cache)
MCP schemas: ~12562 (included in token total)
Agent used tokens:
  Input:  13292 tokens
  Output: 20004 tokens
Judge used tokens:
  Input:  48659 tokens
  Output: 39011 tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants