-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Summary
Sprint tracker for the result schema v2 work centered on #280 and its coupled issues. The schema v2 is the foundation — it defines the data contract that OTel export and evaluator enhancements consume.
Sprint Plan
Phase 1: Foundation (do first)
- feat: result schema v2 — rename fields, add input, nest debug requests #280 — Result schema v2: rename fields, add
input, nest debug requests (2–3 days)
Phase 2: Build on new schema (after #280 lands)
- feat: OpenTelemetry trace export (OTLP/HTTP) #277 — OpenTelemetry trace export / OTLP/HTTP (3–5 days, consumes new schema directly)
Phase 3: Parallel evaluator enhancements (independent, can run alongside Phase 1–2)
- feat: evaluator negation flag (negate: true) #273 — Evaluator negation flag
negate: true(1–2 days) - feat: threshold aggregation strategy for composite evaluator #274 — Threshold aggregation strategy for composite evaluator (1–2 days)
Independent (any time, no code dependencies)
- docs: agent evaluation layer taxonomy (Reasoning / Action / E2E / Safety) #278 — Agent eval layer taxonomy docs
- docs: autoevals (Braintrust) as code_judge scorer library #279 — Autoevals/Braintrust scorer docs
Deferred (next sprint, after schema stabilizes)
- feat: shared JSONL/CSV dataset compatibility with promptfoo #275 — Shared JSONL/CSV dataset compatibility with promptfoo
- feat: promptfoo config import (promptfooconfig.yaml → EVAL.yaml) #271 — promptfoo config import
- feat: promptfoo config export (EVAL.yaml → promptfooconfig.yaml) #272 — promptfoo config export
- tracking: promptfoo integration #276 — promptfoo integration tracker
Sequencing Rationale
- feat: result schema v2 — rename fields, add input, nest debug requests #280 is the foundation — renames
EvaluatorResult→AssertionResult,OutputMessage→Message,output_messages→output, addsinputfield. Every downstream issue consumes these types. - feat: evaluator negation flag (negate: true) #273/feat: threshold aggregation strategy for composite evaluator #274 are safe in parallel — they add evaluator features that don't touch the renamed result schema fields. Trivial merge conflict on the type rename.
- promptfoo deferred — feat: shared JSONL/CSV dataset compatibility with promptfoo #275, feat: promptfoo config import (promptfooconfig.yaml → EVAL.yaml) #271, feat: promptfoo config export (EVAL.yaml → promptfooconfig.yaml) #272, tracking: promptfoo integration #276 depend on the finalized schema but are lower priority this sprint. Queue for next sprint.
Total Effort Estimate
~8–12 days of work across sprint issues, with phases 1–3 overlapping for a sprint duration of ~5–7 days.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels