multi-agent v2: Agent, TaskSet, Registry abstractions by Bhoy1 · Pull Request #961 · PrimeIntellect-ai/verifiers

Bhoy1 · 2026-02-26T02:19:49Z

Summary

Rebuilt multi-agent support with clean separation of concerns
Agent: WHO responds — model, client, system_prompt, sampling config
TaskSet: WHAT to solve — game rules, prompts, scoring, reusable across agents
Registry: HOW to wire — connects agents to envs, enables Registry.spawn() for parallel child rollouts
MultiAgentEnv: engine that runs the game loop, supports TaskSet mode and subclass mode
Agent system_prompt auto-prepended to prompts built by TaskSet
Actor → Agent, Protocol → Registry

Environments on new pattern:

twenty_questions, poker, rock_paper_scissors, proposer_solver (TaskSet)
poker_multi (TaskSet, 2-9 players)
proposer_code_solver (TaskSet + tool execution, new)
arc_multiagent (subclass mode with Registry.spawn(), hierarchical pipeline)

Test plan

poker_multi: 6-player games with per-agent models, JSON actions, rewards
twenty_questions: alternating turns, win detection, efficiency scoring
proposer_code_solver: tool execution with code REPL
arc_multiagent: hierarchical spawning pipeline
Training loop integration (GRPO with per-actor states)

Note

Medium Risk
Large new environments add complex orchestration (subprocess sandboxing, parallel spawning, multimodal prompting) and game logic, which may introduce runtime/platform issues, but they are additive with minimal impact on existing core behavior.

Overview
Adds a new docs/multiagent.md guide describing multi-agent environment patterns (turn-taking, simultaneous moves, per-actor rewards/GRPO, and hierarchical spawning).

Introduces a full ARC-AGI multi-agent solver environment (environments/arc_multiagent) that orchestrates parallel solver strategies (shallow/deep, codegen with sandboxed execution, vision, objects/hints) via Registry.spawn(), then selects final answers via a duo-pick judge council with logic/consistency judging fallback; includes packaging metadata.

Adds new poker environments: environments/poker (heads-up) and environments/poker_multi (2–9 players) implemented in the new TaskSet composition style with per-player agents, JSON action parsing, and chip-based per-actor rewards, along with their pyproject.toml configs.

^{Written by Cursor Bugbot for commit 6821879. This will update automatically on new commits. Configure here.}

…bric

…de rollout

…tory step

… environments

…lay different models per actor. Should show in logs in CLI display. Example in twenty question/poker.

… Also, can setup currently to have different models with different system prompts play each other. Still validating environments

…into Bhoy1/Multiagent

…up signatures, fix eval_utils KeyError for child states

Replace Actor/Protocol with Agent/TaskSet/Registry pattern. Agent owns model config + system prompt, TaskSet owns game logic, Registry wires agents to envs and enables parallel spawn.

cursor

Cursor Bugbot has reviewed your changes and found 5 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable autofix in the Cursor dashboard.}

cursor · 2026-02-26T02:29:33Z

+
+    for i, (inp, out) in enumerate(grids):
+        ax_in = axes[i][0] if num_pairs > 1 else axes[0]
+        ax_out = axes[i][1] if num_pairs > 1 else axes[1]


Image generation crashes when only one training pair

High Severity

When num_pairs == 1, plt.subplots(1, 2) returns a 1D array [ax1, ax2]. After wrapping with axes = [axes], the list becomes [[ax1, ax2]]. The else branch then accesses axes[0] (which gives the inner array, not a single axis) and axes[1] (which raises an IndexError since the outer list has only one element). The condition incorrectly bypasses the wrapped indexing axes[i][0] / axes[i][1] which would work correctly after the wrapping.

cursor · 2026-02-26T02:29:33Z

+    async def should_stop(self, state: State) -> bool:
+        extras = state.get("extras", {})
+        if len(extras.get("folded", [])) > 0:
+            return True


Fold immediately ends multi-hand poker session

High Severity

The should_stop check if len(extras.get("folded", [])) > 0: return True immediately ends the entire game session on any fold, even though num_hands defaults to 5. Since folds are extremely common in heads-up poker, most games will end after a single hand without ever reaching _resolve_showdown, which is the only code path that starts new hands. The four remaining hands are silently skipped.

cursor · 2026-02-26T02:29:33Z

+        if e.get("actions_hand", 0) >= self.max_actions:
+            self._resolve_showdown(state)
+            return True
+        return False


Fold path skips new hand in multi-hand poker

Low Severity

When all but one player fold, should_stop awards the pot and sets phase = "complete" without attempting to start a new hand, unlike _resolve_showdown which checks hands_played < num_hands and calls _start_hand. With num_hands > 1, a hand ending via fold terminates the entire session instead of rotating the dealer and starting the next hand.

cursor · 2026-02-26T02:29:33Z

+            dealer = extras["dealer"]
+            non_dealer = "player2" if dealer == "player1" else "player1"
+            if non_dealer not in extras["folded"]:
+                extras["current_actor_id"] = non_dealer


Postflop turn order reversed by _advance_phase override

Medium Severity

_advance_phase sets extras["current_actor_id"] = non_dealer intending for the non-dealer to act first postflop. But the framework's get_next_role returns the opposite of current_actor_id, so this causes the dealer to act first postflop instead. In heads-up poker, the non-dealer acts first on flop/turn/river. When the dealer was the last preflop actor (e.g., dealer called a BB raise), the natural turn order would have been correct, but this override actively breaks it.

cursor · 2026-02-26T02:29:33Z

+rollouts_per_example = 1
+num_players = 6
+num_hands = 3
+max_actions_per_hand = 50


Config parameter name mismatches load_environment signature

Medium Severity

The pyproject.toml specifies max_actions_per_hand = 50 under [tool.verifiers.eval], but load_environment expects the parameter max_actions, not max_actions_per_hand. This mismatch means the config value either causes a TypeError (unexpected keyword argument) or is silently ignored, with the function falling back to its default of 50. Users adjusting this config setting would see no effect.

Additional Locations (1)

environments/poker_multi/poker_multi.py#L600-L601

Override run_group() in MultiAgentEnv to split game states into per-actor states before scoring. Each trainable actor gets its own RolloutOutput with correct reward. Non-trainable actors excluded.

13 Level 1 generators (color replacement, rotation, reflection, scaling, gravity, flood fill, boolean ops, duplication, hollow, histogram, dimension patterns, connectivity filter, bounding rect fill). Dataset pushed to bhoy/arc-synthetic (260 tasks). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

3 actors: codegen_v1b + codegen_v4 (Qwen3-4B) + image (Qwen3-VL-4B). Renders grids as colored PNG via matplotlib for vision model. Same level/task_type filtering as non-vision variant.

Categories: symmetry completion, border drawing, template cloning, shape stamping, object extraction, line extension, denoising, interior fill, crosshair generation, object compaction, fill-level encoding, expanding diamond, pair bridging, seam tiling, value expansion, gap-and-beam, missing region reconstruction, noisy grid regularization. Dataset now has 620 tasks (260 L1 + 360 L2).

… broken storage

…tibility

Generator LoRA receives a base_op and outputs a JSON recipe choosing difficulty settings (level, post_ops, seed). Solver LoRA attempts the generated task. GRPO's 8 rollouts per example explore the difficulty space within each operation family.

Split BASE_OPS into L1_OPS (13) and L2_OPS (18) with configurable levels param. Remove post_ops from generator prompt — generator only picks level and seed for now.

Closer to original training ranges. Also add levels param to load_environment for L1/L2 op selection.

Tracks per-op solver reward in sliding window. Ops below skip_threshold are temporarily skipped. Periodic probes re-test dropped ops. Original arc_curriculum env unchanged.

Generator was copying seed:77 level:2 from prompt examples for all 12,800 rollouts. Now shows only the JSON schema with no concrete values to anchor on.

- Removed seed from generator output (LLM just picks cliche numbers) - Generator temp bumped to 1.2 for exploration across 3 levels - Fixed seed=1337 so GRPO compares level choices directly - Parser accepts bare integer fallback for level

Model has strong prior that 2=medium=safe, causing 78% collapse to level 2. A/B/C labels give no semantic hint about difficulty ordering, forcing the model to learn which setting works through reward signal alone.

- Grid sizes match fixed L1 dataset: A=5-8, B=8-12, C=12-16 - 8 seeds per op (104 dataset entries) for task variety - Mastery tracking: rolling window per level, generator gets 0 reward when solver avg exceeds threshold (default 0.7) - Forces generator A→B→C as solver improves

Mastery is now tracked independently for each (operation, level) pair instead of globally per level. This prevents easy ops from inflating the average and prematurely triggering mastery for harder ops.

Seeds now use 10000*level + cat_idx*1000 + task_idx (0-19), identical to bhoy/arc-synthetic. L1 = 13 ops × 20 seeds = 260 entries.

wjh581 and others added 25 commits January 25, 2026 05:37

Add multi-agent support: Actor, Protocol, MultiAgentEnv, MultiAgentRu…

1e5f474

…bric

Simplify multi-agent: remove orchestrator, streamline env and rubric

62ab5f8

cleaning up

f097902

cleaning up

7246cc9

cleaning up

929c90f

cleaning up: added some hooks into multiturn so i dont have to overri…

c4d5613

…de rollout

cleaning up: added some hooks into multiturn to add extras for tracje…

fdf4071

…tory step

cleaning up some more

177b228

trying to unify it better with multiturn and multiagent with the game…

b62afdc

… environments

Add proposer_solver and poker environments, update multiagent support

2a663cc

Add model and client to actor so you can use different endpoints to p…

828f80a

…lay different models per actor. Should show in logs in CLI display. Example in twenty question/poker.

adding poker_multi which is poker extension that allows more players.…

c3af35c

… Also, can setup currently to have different models with different system prompts play each other. Still validating environments

adjust protocol spawn rubric.

dda05da

added error handling and working on making sure timing is per actor

5b1141e

adding documentation

842cb9d

ARC-AGI-2

86bfe6f

Merge branch 'main' into Bhoy1/Multiagent

604903b

updated arc-agi-2 as it still needed a few judges

75b634e

Merge branch 'Bhoy1/Multiagent' of https://github.com/Bhoy1/verifiers …

8d95d58

…into Bhoy1/Multiagent

add ARC multiagent overview doc

cae6005

Merge remote-tracking branch 'origin/main' into Bhoy1/Multiagent

ed70983

Fix multiagent compatibility with upstream: align run_group/score_gro…

fb9751f

…up signatures, fix eval_utils KeyError for child states

Add 2nd pick is_correct to pipeline debug output for pass@2 visibility

05a088d

merge upstream main into Bhoy1/Multiagent

ed72edb

multi-agent v2: Agent, TaskSet, Registry abstractions

6821879

Replace Actor/Protocol with Agent/TaskSet/Registry pattern. Agent owns model config + system prompt, TaskSet owns game logic, Registry wires agents to envs and enables parallel spawn.

Bhoy1 marked this pull request as draft February 26, 2026 02:22

cursor bot reviewed Feb 26, 2026

View reviewed changes

wjh581 added 3 commits February 26, 2026 19:56

multi-agent training: per-actor splitting in run_group()

7b31b0f

Override run_group() in MultiAgentEnv to split game states into per-actor states before scoring. Each trainable actor gets its own RolloutOutput with correct reward. Non-trainable actors excluded.

fix prime-rl compat and Windows support

28dbfde

disable thinking in guesser prompt, increase max_tokens to 100

34952be

wjh581 and others added 30 commits March 18, 2026 19:59

add arc_1d_multistrategy environment and dataset generation script

8700cd8

add arc_synthetic_multistrategy_vision environment

50eef72

3 actors: codegen_v1b + codegen_v4 (Qwen3-4B) + image (Qwen3-VL-4B). Renders grids as colored PNG via matplotlib for vision model. Same level/task_type filtering as non-vision variant.

Add LoRA extraction script and test split generation

c1a83ce

Fix extract_lora.py: clone tensors before saving to safetensors

3f40f71

Fix extract_lora.py: reconstruct tensors with empty_like+copy_ to fix…

9cd251d

… broken storage

Fix extract_lora.py: save as pytorch bin instead of safetensors

3348ad0

Fix extract_lora.py: convert DTensors to plain tensors for vLLM compa…

07568bd

…tibility

Add workspace source for verifiers in arc_synthetic_multistrategy

d604110

Handle SyntaxError in codegen eval instead of crashing

8627d84

Support separate v4_model for per-actor LoRA routing in eval

3dad944

Remove workspace source - breaks when installed from prime-rl

f6ab6c0

Add per-actor reward and pass@k metrics to eval display

6d74e12

Restore workspace source for verifiers env

9eb5385

Debug per-actor pass@k computation

cf79b66

fix per-actor pass@k: divide rollouts by num_actors

6beeac2

pass pass_threshold through load_environment via kwargs

ec52879

add v4_model support to arc_multistrategy for per-actor LoRA routing

d4a3aec

fix LoRA extraction: keep model. prefix in keys, save as safetensors

4d83f6b

simplify curriculum: L1 ops only, no post_ops

d223eb0

Split BASE_OPS into L1_OPS (13) and L2_OPS (18) with configurable levels param. Remove post_ops from generator prompt — generator only picks level and seed for now.

adjust curriculum grid sizes: 2-5, 5-8, 8-12

69241c1

Closer to original training ranges. Also add levels param to load_environment for L1/L2 op selection.

add arc_curriculum_filtered env with dynamic op skipping

142d19a

Tracks per-op solver reward in sliding window. Ops below skip_threshold are temporarily skipped. Periodic probes re-test dropped ops. Original arc_curriculum env unchanged.

Remove few-shot examples from generator prompt to prevent collapse

99ba55b

Generator was copying seed:77 level:2 from prompt examples for all 12,800 rollouts. Now shows only the JSON schema with no concrete values to anchor on.

Use opaque A/B/C labels instead of 1/2/3 for generator levels

581c5bf

Model has strong prior that 2=medium=safe, causing 78% collapse to level 2. A/B/C labels give no semantic hint about difficulty ordering, forcing the model to learn which setting works through reward signal alone.

per-(op, level) mastery tracking for curriculum generator

ae8435a

Mastery is now tracked independently for each (operation, level) pair instead of globally per level. This prevents easy ops from inflating the average and prematurely triggering mastery for harder ops.

match fixed dataset seeds: 20 per op using same formula

2a7ccb6

Seeds now use 10000*level + cat_idx*1000 + task_idx (0-19), identical to bhoy/arc-synthetic. L1 = 13 ops × 20 seeds = 260 entries.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-agent v2: Agent, TaskSet, Registry abstractions#961

multi-agent v2: Agent, TaskSet, Registry abstractions#961
Bhoy1 wants to merge 135 commits intoPrimeIntellect-ai:mainfrom
Bhoy1:Bhoy1/Multiagent_v2

Bhoy1 commented Feb 26, 2026 •

edited by cursor bot

Loading

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Feb 26, 2026

Uh oh!

cursor bot Feb 26, 2026

Uh oh!

cursor bot Feb 26, 2026

Uh oh!

cursor bot Feb 26, 2026

Uh oh!

cursor bot Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Bhoy1 commented Feb 26, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 26, 2026

Choose a reason for hiding this comment

Image generation crashes when only one training pair

Uh oh!

cursor bot Feb 26, 2026

Choose a reason for hiding this comment

Fold immediately ends multi-hand poker session

Uh oh!

cursor bot Feb 26, 2026

Choose a reason for hiding this comment

Fold path skips new hand in multi-hand poker

Uh oh!

cursor bot Feb 26, 2026

Choose a reason for hiding this comment

Postflop turn order reversed by _advance_phase override

Uh oh!

cursor bot Feb 26, 2026

Choose a reason for hiding this comment

Config parameter name mismatches load_environment signature

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Bhoy1 commented Feb 26, 2026 •

edited by cursor bot

Loading