From fbbe5a03b6dfe3ae082a52add3546c1b32f46e35 Mon Sep 17 00:00:00 2001
From: Yogesh Rao <yogesh-tessl@users.noreply.github.com>
Date: Wed, 13 May 2026 20:09:12 +0530
Subject: [PATCH] feat: improve report skill score from 44% to 95%
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Hey @sjarmak 👋

I ran your skills through `tessl skill review` at work and found some targeted improvements. Here's the full before/after:

| Skill | Before | After | Change |
|-------|--------|-------|--------|
| report | 44% | 95% | **+51%** |
| audit | 69% | — | — |
| evaluate | 59% | — | — |
| infra | 54% | — | — |
| next | 56% | — | — |
| run | 71% | — | — |
| scaffold | 59% | — | — |
| status | 73% | — | — |
| triage | 62% | — | — |

The `report` skill had the most headroom — it was missing explicit trigger guidance and a structured workflow, which dragged its score down despite having solid domain content underneath.

<details>
<summary>Changes made to <code>report</code> skill</summary>

- **Expanded frontmatter description** with domain context (CodeScaleBench, baseline vs MCP) and an explicit "Use when..." clause with natural trigger terms (`benchmark results`, `cost breakdowns`, `config comparisons`, `MCP impact analysis`, `pass-rate summaries`)
- **Added a 4-step reporting workflow** — verify run completion → generate report → compare configs → review for anomalies — so the agent follows a sensible sequence instead of picking commands at random
- **Replaced flat command list with a decision table** mapping each reporting goal to the right command, making it easy to pick the correct script
- **Tightened scope bullets** to be more specific about what this skill handles (benchmark runs, token spending by model/suite, baseline vs MCP comparisons)
- **Preserved all domain terminology** — MCP, Sourcegraph, oracle, baseline, IR, dual verification — untouched

</details>

I also stress-tested your `run` skill against a few real-world task evals and it held up really well on paired baseline+MCP execution with multi-account orchestration guardrails. Kudos for that.

Honest disclosure — I work at @tesslio where we build tooling around skills like these. Not a pitch — just saw room for improvement and wanted to contribute.

Want to self-improve your skills? Just point your agent (Claude Code, Codex, etc.) at [this Tessl guide](https://docs.tessl.io/evaluate/optimize-a-skill-using-best-practices) and ask it to optimize your skill. Ping me — [@yogesh-tessl](https://github.com/yogesh-tessl) — if you hit any snags.

Thanks in advance 🙏
---
 skills/report/SKILL.md | 70 ++++++++++++++++++++----------------------
 1 file changed, 34 insertions(+), 36 deletions(-)
diff --git a/skills/report/SKILL.md b/skills/report/SKILL.md
index 94f0cac0c0..3575373c03 100644
--- a/skills/report/SKILL.md
+++ b/skills/report/SKILL.md
@@ -1,6 +1,6 @@
 ---
 name: report
-description: Generate evaluation reports, analyze run costs, and compare configurations.
+description: "Generate CodeScaleBench evaluation reports, analyze benchmark run costs, and compare baseline vs MCP configuration performance. Use when the user asks for benchmark results, cost breakdowns, config comparisons, MCP impact analysis, or pass-rate summaries."
 ---
 
 # Skill: Reports & Analysis
@@ -8,41 +8,39 @@ description: Generate evaluation reports, analyze run costs, and compare configu
 ## Scope
 
 Use this skill when the user asks to:
-- Generate comprehensive evaluation reports
-- Analyze run costs and spending
-- Compare configurations and model performance
-- Perform impact analysis (IR, MCP effects)
-- Export and visualize results
-
-## Canonical Commands
-
-```bash
-# Generate full evaluation report
-python3 scripts/maintenance/generate_eval_report.py
-
-# Cost breakdown analysis
-python3 scripts/evaluation/cost_report.py --run-dir runs/staging/run_dir
-
-# Cost analysis by model/suite
-python3 scripts/evaluation/cost_breakdown_analysis.py --staging
-
-# Compare configurations
-python3 scripts/evaluation/compare_configs.py --config1 baseline --config2 sourcegraph_full
-
-# IR analysis (information retrieval)
-python3 scripts/evaluation/ir_analysis.py --run-dir runs/staging/run_dir
-
-# Oracle impact analysis
-python3 scripts/evaluation/oracle_ir_analysis.py --runs-dir runs/staging
-```
-
-## Report Types
-
-- **Evaluation Report** — task-level metrics, pass rates by suite/model
-- **Cost Report** — token usage, API calls, infrastructure costs
-- **Config Comparison** — baseline vs MCP performance delta
-- **Impact Analysis** — retrieval quality and verifier contribution
-- **Trend Analysis** — performance over time across runs
+- Generate evaluation reports for benchmark runs
+- Analyze run costs and token spending by model or suite
+- Compare baseline vs MCP (Sourcegraph) configuration performance
+- Assess information retrieval (IR) and oracle impact
+- Summarize pass rates, model success, or cost-per-task metrics
+
+## Reporting Workflow
+
+1. **Verify run completion** — confirm runs finished before reporting:
+   ```bash
+   python3 scripts/analysis/aggregate_status.py --staging
+   ```
+   If completion rate is below target, use `/status` to investigate before generating reports.
+
+2. **Generate the report** — pick the command matching your goal (see "When to Use Which Report" below).
+
+3. **Compare configurations** — for A/B analysis between baseline and MCP:
+   ```bash
+   python3 scripts/evaluation/compare_configs.py --config1 baseline --config2 sourcegraph_full
+   ```
+
+4. **Review and share** — check output for anomalies (unexpected zero scores, missing suites) before sharing results.
+
+## When to Use Which Report
+
+| Goal | Command |
+|------|---------|
+| Full evaluation summary (pass rates, model scores) | `python3 scripts/maintenance/generate_eval_report.py` |
+| Cost breakdown for a single run | `python3 scripts/evaluation/cost_report.py --run-dir runs/staging/<run>` |
+| Cost comparison across models/suites | `python3 scripts/evaluation/cost_breakdown_analysis.py --staging` |
+| Baseline vs MCP performance delta | `python3 scripts/evaluation/compare_configs.py --config1 baseline --config2 sourcegraph_full` |
+| Retrieval quality (IR tasks) | `python3 scripts/evaluation/ir_analysis.py --run-dir runs/staging/<run>` |
+| Oracle and verifier contribution | `python3 scripts/evaluation/oracle_ir_analysis.py --runs-dir runs/staging` |
 
 ## Key Metrics