chore(evals): Update model evaluations 2026-06-02 by rhacs-bot · Pull Request #136 · stackrox/stackrox-mcp

rhacs-bot · 2026-06-02T08:06:00Z

Automated weekly model evaluation update.

Models evaluated: gpt-5-mini
Date: 2026-06-02

This PR was automatically generated by the Model Evaluation workflow.

coderabbitai · 2026-06-02T08:06:12Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: f7a679be-86f1-4d16-b869-c645ba796b6c

📥 Commits

Reviewing files that changed from the base of the PR and between 81ce9af and ba8afd0.

📒 Files selected for processing (1)

docs/model-evaluation.md

📝 Walkthrough

Summary by CodeRabbit

Documentation
- Updated model evaluation results with improved performance metrics from the latest evaluation run.

Walkthrough

This PR updates the gpt-5-mini evaluation results in the model evaluation documentation. The dated results block for 2026-05-26 (showing 10/11 tasks passed) is replaced with 2026-06-02 results (showing 11/11 tasks passed), including an updated per-task result table and token usage metrics.

Changes

Cohort / File(s)	Summary
Evaluation Results Documentation `docs/model-evaluation.md`	Updated gpt-5-mini evaluation metrics from 2026-05-26 run (90% pass rate, 10/11 tasks) to 2026-06-02 run (100% pass rate, 11/11 tasks), including revised per-task results table and token counts.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: updating model evaluation results with a specific date (2026-06-02).
Description check	✅ Passed	The description is relevant to the changeset, providing context about the automated evaluation update, models tested, and evaluation date.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chore/update-model-evaluation-2026-06-02

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov-commenter · 2026-06-02T08:10:36Z

❌ 2 Tests Failed:

Tests completed	Failed	Passed	Skipped
380	2	378	12

View the full list of 2 ❄️ flaky test(s)

::policy 1
Flake rate in main: 100.00% (Passed 0 times, Failed 46 times)
Stack Traces | 0s run time
- test violation 1
- test violation 2
- test violation 3

::policy 4
Flake rate in main: 100.00% (Passed 0 times, Failed 46 times)
Stack Traces | 0s run time
- testing multiple alert violation messages 1
- testing multiple alert violation messages 2
- testing multiple alert violation messages 3

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

github-actions · 2026-06-02T08:16:03Z

E2E Test Results

Commit: ba8afd0
Workflow Run: View Details
Artifacts: Download test results & logs

=== Evaluation Summary ===

  ✓ list-clusters (assertions: 3/3)
  ✓ cve-log4shell (assertions: 3/3)
  ✓ cve-multiple (assertions: 3/3)
  ✓ cve-detected-clusters (assertions: 3/3)
  ✓ cve-detected-workloads (assertions: 3/3)
  ✓ cve-nonexistent (assertions: 3/3)
  ✓ cve-clusters-general (assertions: 3/3)
  ✓ cve-cluster-list (assertions: 3/3)
  ✓ cve-cluster-does-exist (assertions: 3/3)
  ✓ rhsa-not-supported (assertions: 2/2)
  ✓ cve-cluster-does-not-exist (assertions: 3/3)

Tasks:      11/11 passed (100.00%)
Assertions: 32/32 passed (100.00%)
Tokens:     ~51728 (estimate - excludes system prompt & cache)
MCP schemas: ~12562 (included in token total)
Agent used tokens:
  Input:  13292 tokens
  Output: 20004 tokens
Judge used tokens:
  Input:  48659 tokens
  Output: 39011 tokens

Update model evaluations 2026-06-02

ba8afd0

rhacs-bot requested a review from janisz as a code owner June 2, 2026 08:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(evals): Update model evaluations 2026-06-02#136

chore(evals): Update model evaluations 2026-06-02#136
rhacs-bot wants to merge 1 commit into
mainfrom
chore/update-model-evaluation-2026-06-02

rhacs-bot commented Jun 2, 2026

Uh oh!

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Uh oh!

codecov-commenter commented Jun 2, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rhacs-bot commented Jun 2, 2026

Uh oh!

coderabbitai Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Uh oh!

codecov-commenter commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ 2 Tests Failed:

Uh oh!

github-actions Bot commented Jun 2, 2026

E2E Test Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading

codecov-commenter commented Jun 2, 2026 •

edited

Loading