Skip to content

multi-agent v2: Agent, TaskSet, Registry abstractions#961

Draft
Bhoy1 wants to merge 135 commits intoPrimeIntellect-ai:mainfrom
Bhoy1:Bhoy1/Multiagent_v2
Draft

multi-agent v2: Agent, TaskSet, Registry abstractions#961
Bhoy1 wants to merge 135 commits intoPrimeIntellect-ai:mainfrom
Bhoy1:Bhoy1/Multiagent_v2

Conversation

@Bhoy1
Copy link
Copy Markdown

@Bhoy1 Bhoy1 commented Feb 26, 2026

Summary

  • Rebuilt multi-agent support with clean separation of concerns
  • Agent: WHO responds — model, client, system_prompt, sampling config
  • TaskSet: WHAT to solve — game rules, prompts, scoring, reusable across agents
  • Registry: HOW to wire — connects agents to envs, enables Registry.spawn() for parallel child rollouts
  • MultiAgentEnv: engine that runs the game loop, supports TaskSet mode and subclass mode
  • Agent system_prompt auto-prepended to prompts built by TaskSet
  • ActorAgent, ProtocolRegistry

Environments on new pattern:

  • twenty_questions, poker, rock_paper_scissors, proposer_solver (TaskSet)
  • poker_multi (TaskSet, 2-9 players)
  • proposer_code_solver (TaskSet + tool execution, new)
  • arc_multiagent (subclass mode with Registry.spawn(), hierarchical pipeline)

Test plan

  • poker_multi: 6-player games with per-agent models, JSON actions, rewards
  • twenty_questions: alternating turns, win detection, efficiency scoring
  • proposer_code_solver: tool execution with code REPL
  • arc_multiagent: hierarchical spawning pipeline
  • Training loop integration (GRPO with per-actor states)

Note

Medium Risk
Large new environments add complex orchestration (subprocess sandboxing, parallel spawning, multimodal prompting) and game logic, which may introduce runtime/platform issues, but they are additive with minimal impact on existing core behavior.

Overview
Adds a new docs/multiagent.md guide describing multi-agent environment patterns (turn-taking, simultaneous moves, per-actor rewards/GRPO, and hierarchical spawning).

Introduces a full ARC-AGI multi-agent solver environment (environments/arc_multiagent) that orchestrates parallel solver strategies (shallow/deep, codegen with sandboxed execution, vision, objects/hints) via Registry.spawn(), then selects final answers via a duo-pick judge council with logic/consistency judging fallback; includes packaging metadata.

Adds new poker environments: environments/poker (heads-up) and environments/poker_multi (2–9 players) implemented in the new TaskSet composition style with per-player agents, JSON action parsing, and chip-based per-actor rewards, along with their pyproject.toml configs.

Written by Cursor Bugbot for commit 6821879. This will update automatically on new commits. Configure here.

wjh581 and others added 25 commits January 25, 2026 05:37
…lay different models per actor. Should show in logs in CLI display. Example in twenty question/poker.
… Also, can setup currently to have different models with different system prompts play each other. Still validating environments
…up signatures, fix eval_utils KeyError for child states
Replace Actor/Protocol with Agent/TaskSet/Registry pattern.
Agent owns model config + system prompt, TaskSet owns game logic,
Registry wires agents to envs and enables parallel spawn.
@Bhoy1 Bhoy1 marked this pull request as draft February 26, 2026 02:22
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 5 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable autofix in the Cursor dashboard.


for i, (inp, out) in enumerate(grids):
ax_in = axes[i][0] if num_pairs > 1 else axes[0]
ax_out = axes[i][1] if num_pairs > 1 else axes[1]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image generation crashes when only one training pair

High Severity

When num_pairs == 1, plt.subplots(1, 2) returns a 1D array [ax1, ax2]. After wrapping with axes = [axes], the list becomes [[ax1, ax2]]. The else branch then accesses axes[0] (which gives the inner array, not a single axis) and axes[1] (which raises an IndexError since the outer list has only one element). The condition incorrectly bypasses the wrapped indexing axes[i][0] / axes[i][1] which would work correctly after the wrapping.

Fix in Cursor Fix in Web

Comment thread environments/poker/poker.py Outdated
async def should_stop(self, state: State) -> bool:
extras = state.get("extras", {})
if len(extras.get("folded", [])) > 0:
return True
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fold immediately ends multi-hand poker session

High Severity

The should_stop check if len(extras.get("folded", [])) > 0: return True immediately ends the entire game session on any fold, even though num_hands defaults to 5. Since folds are extremely common in heads-up poker, most games will end after a single hand without ever reaching _resolve_showdown, which is the only code path that starts new hands. The four remaining hands are silently skipped.

Fix in Cursor Fix in Web

if e.get("actions_hand", 0) >= self.max_actions:
self._resolve_showdown(state)
return True
return False
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fold path skips new hand in multi-hand poker

Low Severity

When all but one player fold, should_stop awards the pot and sets phase = "complete" without attempting to start a new hand, unlike _resolve_showdown which checks hands_played < num_hands and calls _start_hand. With num_hands > 1, a hand ending via fold terminates the entire session instead of rotating the dealer and starting the next hand.

Fix in Cursor Fix in Web

Comment thread environments/poker/poker.py Outdated
dealer = extras["dealer"]
non_dealer = "player2" if dealer == "player1" else "player1"
if non_dealer not in extras["folded"]:
extras["current_actor_id"] = non_dealer
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Postflop turn order reversed by _advance_phase override

Medium Severity

_advance_phase sets extras["current_actor_id"] = non_dealer intending for the non-dealer to act first postflop. But the framework's get_next_role returns the opposite of current_actor_id, so this causes the dealer to act first postflop instead. In heads-up poker, the non-dealer acts first on flop/turn/river. When the dealer was the last preflop actor (e.g., dealer called a BB raise), the natural turn order would have been correct, but this override actively breaks it.

Fix in Cursor Fix in Web

rollouts_per_example = 1
num_players = 6
num_hands = 3
max_actions_per_hand = 50
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Config parameter name mismatches load_environment signature

Medium Severity

The pyproject.toml specifies max_actions_per_hand = 50 under [tool.verifiers.eval], but load_environment expects the parameter max_actions, not max_actions_per_hand. This mismatch means the config value either causes a TypeError (unexpected keyword argument) or is silently ignored, with the function falling back to its default of 50. Users adjusting this config setting would see no effect.

Additional Locations (1)

Fix in Cursor Fix in Web

Override run_group() in MultiAgentEnv to split game states into
per-actor states before scoring. Each trainable actor gets its own
RolloutOutput with correct reward. Non-trainable actors excluded.
wjh581 and others added 30 commits March 18, 2026 19:59
13 Level 1 generators (color replacement, rotation, reflection, scaling,
gravity, flood fill, boolean ops, duplication, hollow, histogram,
dimension patterns, connectivity filter, bounding rect fill).
Dataset pushed to bhoy/arc-synthetic (260 tasks).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
3 actors: codegen_v1b + codegen_v4 (Qwen3-4B) + image (Qwen3-VL-4B).
Renders grids as colored PNG via matplotlib for vision model.
Same level/task_type filtering as non-vision variant.
Categories: symmetry completion, border drawing, template cloning,
shape stamping, object extraction, line extension, denoising,
interior fill, crosshair generation, object compaction, fill-level
encoding, expanding diamond, pair bridging, seam tiling, value
expansion, gap-and-beam, missing region reconstruction, noisy grid
regularization. Dataset now has 620 tasks (260 L1 + 360 L2).
Generator LoRA receives a base_op and outputs a JSON recipe choosing
difficulty settings (level, post_ops, seed). Solver LoRA attempts the
generated task. GRPO's 8 rollouts per example explore the difficulty
space within each operation family.
Split BASE_OPS into L1_OPS (13) and L2_OPS (18) with configurable
levels param. Remove post_ops from generator prompt — generator
only picks level and seed for now.
Closer to original training ranges. Also add levels param
to load_environment for L1/L2 op selection.
Tracks per-op solver reward in sliding window. Ops below skip_threshold
are temporarily skipped. Periodic probes re-test dropped ops.
Original arc_curriculum env unchanged.
Generator was copying seed:77 level:2 from prompt examples for all 12,800 rollouts.
Now shows only the JSON schema with no concrete values to anchor on.
- Removed seed from generator output (LLM just picks cliche numbers)
- Generator temp bumped to 1.2 for exploration across 3 levels
- Fixed seed=1337 so GRPO compares level choices directly
- Parser accepts bare integer fallback for level
Model has strong prior that 2=medium=safe, causing 78% collapse to level 2.
A/B/C labels give no semantic hint about difficulty ordering, forcing the
model to learn which setting works through reward signal alone.
- Grid sizes match fixed L1 dataset: A=5-8, B=8-12, C=12-16
- 8 seeds per op (104 dataset entries) for task variety
- Mastery tracking: rolling window per level, generator gets 0 reward
  when solver avg exceeds threshold (default 0.7)
- Forces generator A→B→C as solver improves
Mastery is now tracked independently for each (operation, level) pair
instead of globally per level. This prevents easy ops from inflating
the average and prematurely triggering mastery for harder ops.
Seeds now use 10000*level + cat_idx*1000 + task_idx (0-19),
identical to bhoy/arc-synthetic. L1 = 13 ops × 20 seeds = 260 entries.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants