Skip to content

POLLUX LLM-Judge metric#1210

Open
ulyanaisaeva wants to merge 1 commit intohuggingface:mainfrom
ulyanaisaeva:main
Open

POLLUX LLM-Judge metric#1210
ulyanaisaeva wants to merge 1 commit intohuggingface:mainfrom
ulyanaisaeva:main

Conversation

@ulyanaisaeva
Copy link
Copy Markdown

This PR adds POLLUX, a criteria-based LLM-judge suitable for any generative tasks with customizable criteria descriptions.

PolluxLLMJudgeMetric is exposed as a custom metric class in lighteval.metrics.metrics_sample (not registered on the built-in Metrics enum), because both the scoring scale and criterion description are defined at initialization time.

Summary

  • Adds a sample-level POLLUX metric, PolluxLLMJudgeMetric, built on JudgeLM (OpenAI-compatible HTTP API by default, plus existing LightEval judge backends).
  • The class subclasses SampleLevelComputation, not JudgeLLM, so any judge model id (e.g. Hugging Face repo ids for POLLUX checkpoints) works with the openai backend without the OpenAI-model whitelist enforced on JudgeLLM.
  • Prompt formatting: get_judge_prompt_pollux and _build_pollux_prompt_text in judge_utils.
  • Parsing: make_pollux_score_parser(pattern) and make_pollux_feedback_parser(pattern) with defaults POLLUX_DEFAULT_SCORE_RE (bare numeric answer) and no feedback (POLLUX_DEFAULT_FEEDBACK_RE is None). For tagged judge output, use POLLUX_TAGGED_SCORE_RE and POLLUX_TAGGED_FEEDBACK_RE ([RESULT]…[END], [FEEDBACK]…[RESULT]). process_judge_response_pollux / parse_pollux_feedback remain aliases for the default factories.
  • Per sample: always pollux_score. Optional pollux_feedback when include_feedback=True and a non-None feedback_pattern is supplied (otherwise feedback is the empty string). Corpus aggregation should average only pollux_score (or a custom corpus_level_fn), not free-text feedback.
  • Data mapping: question from doc.query, answer from response.final_text[0], optional reference from doc.specific["reference_answer"] passed as POLLUX gold; options are always None for this metric.
  • Docs: docs/source/metric-list.mdx (usage + patterns), [[autodoc]] for PolluxLLMJudgeMetric in docs/source/package_reference/metrics.mdx.
  • Tests: tests/unit/metrics/test_pollux_judge.py (mocked JudgeLM, no network). tests/unit/metrics/test_cases/pollux_judge.json is a deliberate placeholder with metric_class pollux_llm_judge_custom so automated JSON runs skip it until a Metrics enum entry exists.

Why this metric is not on the Metrics enum

  • Metrics lists preset metrics with fixed wiring (metric_name, corpus_level_fn, …).
  • POLLUX is equipped with per-run configuration: criterion text, rubric map, judge model, backend, URL, optional tagged patterns, etc.; there is no single default that fits all benchmarks.
  • Public API: wrap with SampleLevelMetric(..., sample_level_fn=PolluxLLMJudgeMetric(...), batched_compute=True, ...), like other custom metrics. Adding an enum member without an agreed preset would bloat the API.

(If a canonical preset appears later, we could add an enum entry and, if needed, list it in SKIPPED_METRICS for automated JSON — same idea as simpleqa_judge.)


@ulyanaisaeva ulyanaisaeva marked this pull request as ready for review April 13, 2026 08:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant