POLLUX LLM-Judge metric by ulyanaisaeva · Pull Request #1210 · huggingface/lighteval

ulyanaisaeva · 2026-04-10T13:18:35Z

This PR adds POLLUX, a criteria-based LLM-judge suitable for any generative tasks with customizable criteria descriptions.

PolluxLLMJudgeMetric is exposed as a custom metric class in lighteval.metrics.metrics_sample (not registered on the built-in Metrics enum), because both the scoring scale and criterion description are defined at initialization time.

Summary

Adds a sample-level POLLUX metric, PolluxLLMJudgeMetric, built on JudgeLM (OpenAI-compatible HTTP API by default, plus existing LightEval judge backends).
The class subclasses SampleLevelComputation, not JudgeLLM, so any judge model id (e.g. Hugging Face repo ids for POLLUX checkpoints) works with the openai backend without the OpenAI-model whitelist enforced on JudgeLLM.
Prompt formatting: get_judge_prompt_pollux and _build_pollux_prompt_text in judge_utils.
Parsing: make_pollux_score_parser(pattern) and make_pollux_feedback_parser(pattern) with defaults POLLUX_DEFAULT_SCORE_RE (bare numeric answer) and no feedback (POLLUX_DEFAULT_FEEDBACK_RE is None). For tagged judge output, use POLLUX_TAGGED_SCORE_RE and POLLUX_TAGGED_FEEDBACK_RE ([RESULT]…[END], [FEEDBACK]…[RESULT]). process_judge_response_pollux / parse_pollux_feedback remain aliases for the default factories.
Per sample: always pollux_score. Optional pollux_feedback when include_feedback=True and a non-None feedback_pattern is supplied (otherwise feedback is the empty string). Corpus aggregation should average only pollux_score (or a custom corpus_level_fn), not free-text feedback.
Data mapping: question from doc.query, answer from response.final_text[0], optional reference from doc.specific["reference_answer"] passed as POLLUX gold; options are always None for this metric.
Docs: docs/source/metric-list.mdx (usage + patterns), [[autodoc]] for PolluxLLMJudgeMetric in docs/source/package_reference/metrics.mdx.
Tests: tests/unit/metrics/test_pollux_judge.py (mocked JudgeLM, no network). tests/unit/metrics/test_cases/pollux_judge.json is a deliberate placeholder with metric_class pollux_llm_judge_custom so automated JSON runs skip it until a Metrics enum entry exists.

Why this metric is not on the `Metrics` enum

Metrics lists preset metrics with fixed wiring (metric_name, corpus_level_fn, …).
POLLUX is equipped with per-run configuration: criterion text, rubric map, judge model, backend, URL, optional tagged patterns, etc.; there is no single default that fits all benchmarks.
Public API: wrap with SampleLevelMetric(..., sample_level_fn=PolluxLLMJudgeMetric(...), batched_compute=True, ...), like other custom metrics. Adding an enum member without an agreed preset would bloat the API.

(If a canonical preset appears later, we could add an enum entry and, if needed, list it in SKIPPED_METRICS for automated JSON — same idea as simpleqa_judge.)

added pollux judge (#1)

be73fdb

ulyanaisaeva marked this pull request as ready for review April 13, 2026 08:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POLLUX LLM-Judge metric#1210

POLLUX LLM-Judge metric#1210
ulyanaisaeva wants to merge 1 commit intohuggingface:mainfrom
ulyanaisaeva:main

ulyanaisaeva commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ulyanaisaeva commented Apr 10, 2026

Summary

Why this metric is not on the Metrics enum

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Why this metric is not on the `Metrics` enum