feat: promptfoo config import (promptfooconfig.yaml → EVAL.yaml)

## Summary

Add `agentv import promptfoo` command that converts `promptfooconfig.yaml` into `EVAL.yaml`, enabling users with existing promptfoo test suites to adopt AgentV without rewriting their evaluations.

## Motivation

promptfoo is the largest open-source LLM eval tool (6,400+ stars, 10M+ claimed users). Supporting their format gives AgentV instant access to the largest eval community and reduces adoption friction. Users shouldn't have to choose — they should be able to use both tools on the same test suites.

**Research reference**: [integration-assessment-promptfoo-braintrust.md](https://github.com/agentevals/agentevals-research/blob/main/research/proposals/integration-assessment-promptfoo-braintrust.md)

## Core Assertion Type Mapping

| promptfoo assertion | AgentEvals evaluator | Notes |
|--------------------|---------------------|-------|
| `llm-rubric` | `llm_judge` (freeform mode) | Direct mapping |
| `factuality` | `llm_judge` (rubric mode, factuality prompt) | Prompt template differs |
| `g-eval` | `llm_judge` (rubric mode, CoT) | G-Eval is CoT-enhanced rubric |
| `contains` / `icontains` | `field_accuracy` (contains mode) | Case sensitivity flag |
| `equals` | `field_accuracy` (exact mode) | Direct mapping |
| `regex` | `field_accuracy` (regex mode) | Direct mapping |
| `is-json` | `field_accuracy` (json_valid mode) | Schema validation optional |
| `similar` | `field_accuracy` (semantic mode) | Embedding-based |
| `tool-call-f1` | `tool_trajectory` (any_order mode) | F1 vs match semantics differ |
| `cost` / `latency` | `execution_metrics` | Direct mapping |
| `javascript` / `python` | `code_judge` | Language flag |
| `context-faithfulness` | `llm_judge` (faithfulness prompt) | RAG-specific |
| `context-recall` | `llm_judge` (recall prompt) | RAG-specific |

## What Doesn't Map Cleanly

- `assert-set` with threshold logic — see #235 (assert-set evaluator)
- `not-` prefix negation — needs `negate: true` flag (see #271)
- Combinatorial variable expansion — keep as promptfoo-specific, document that users pre-expand
- Matrix evaluation (prompts × providers × tests) — different paradigm, AgentV evaluates single target

## CLI Interface

```bash
# Convert a promptfoo config to EVAL.yaml
agentv import promptfoo ./promptfooconfig.yaml

# Convert with output path
agentv import promptfoo ./promptfooconfig.yaml -o ./evals/EVAL.yaml

# Dry run — show mapping without writing
agentv import promptfoo ./promptfooconfig.yaml --dry-run
```

## Acceptance Criteria

- [ ] Parses `promptfooconfig.yaml` (YAML and JSON variants)
- [ ] Maps the top 15 most-used assertion types to AgentEvals evaluators
- [ ] Converts inline test cases (`tests:` section)
- [ ] Handles `file://` references for external test data (CSV, JSONL, YAML)
- [ ] Converts `defaultTest` to shared evaluator config
- [ ] Preserves `description` and `vars` metadata
- [ ] Unmappable assertions converted to `code_judge` with a comment noting the original type
- [ ] `--dry-run` flag shows the mapping without writing files
- [ ] Integration tests with real promptfoo config examples

## Effort Estimate

3-5 days

## Design Principle

Format conversion only — no runtime dependency on promptfoo. AgentV parses the YAML itself and maps to its own types.

promptfoo assertion	AgentEvals evaluator	Notes
`llm-rubric`	`llm_judge` (freeform mode)	Direct mapping
`factuality`	`llm_judge` (rubric mode, factuality prompt)	Prompt template differs
`g-eval`	`llm_judge` (rubric mode, CoT)	G-Eval is CoT-enhanced rubric
`contains` / `icontains`	`field_accuracy` (contains mode)	Case sensitivity flag
`equals`	`field_accuracy` (exact mode)	Direct mapping
`regex`	`field_accuracy` (regex mode)	Direct mapping
`is-json`	`field_accuracy` (json_valid mode)	Schema validation optional
`similar`	`field_accuracy` (semantic mode)	Embedding-based
`tool-call-f1`	`tool_trajectory` (any_order mode)	F1 vs match semantics differ
`cost` / `latency`	`execution_metrics`	Direct mapping
`javascript` / `python`	`code_judge`	Language flag
`context-faithfulness`	`llm_judge` (faithfulness prompt)	RAG-specific
`context-recall`	`llm_judge` (recall prompt)	RAG-specific

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: promptfoo config import (promptfooconfig.yaml → EVAL.yaml) #271

Summary

Motivation

Core Assertion Type Mapping

What Doesn't Map Cleanly

CLI Interface

Acceptance Criteria

Effort Estimate

Design Principle

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: promptfoo config import (promptfooconfig.yaml → EVAL.yaml) #271

Description

Summary

Motivation

Core Assertion Type Mapping

What Doesn't Map Cleanly

CLI Interface

Acceptance Criteria

Effort Estimate

Design Principle

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions