Skip to content

support determinisitc RL training with sglang#1644

Open
YanhuiDua wants to merge 2 commits intoInternLM:mainfrom
YanhuiDua:determ_infer
Open

support determinisitc RL training with sglang#1644
YanhuiDua wants to merge 2 commits intoInternLM:mainfrom
YanhuiDua:determ_infer

Conversation

@YanhuiDua
Copy link
Copy Markdown
Collaborator

@YanhuiDua YanhuiDua commented Mar 30, 2026

  • add moe determisitic rl training

@YanhuiDua
Copy link
Copy Markdown
Collaborator Author

@claude review

data_path = os.environ["DATA_PATH"]
eval_data_path = os.environ["EVAL_DATA_PATH"]
enable_evaluate = True if eval_data_path != "" else False
enbale_partial_rollout = int(os.environ.get("ENBALE_PARTIAL_ROLLOUT", "0"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Typo: enbale_partial_rollout should be enable_partial_rollout. The env var name ENBALE_PARTIAL_ROLLOUT has the same typo.

I see this typo is inherited from other config files (rl_qwen3_8B_grpo.py, rl_qwen25_7B_dapo.py), so it's a pre-existing issue. However, since this is a rename/cleanup of this config file, it would be a good opportunity to fix it here and in the env var name.

stops: Annotated[list[str], Parameter(help="List of stop sequences.")] = []
stop_token_ids: Annotated[list[int], Parameter(help="List of stop token IDs.")] = []
skip_special_tokens: Annotated[bool, Parameter(help="Whether to skip special tokens.")] = True
sampling_seed: Annotated[int, Parameter(help="The seed for random number generator in sampling.")] = 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: A default sampling_seed of 0 can be ambiguous — some backends treat 0 as "use random seed" while others treat it as an actual seed value. Consider using None as the default with int | None type, and only pass it to the backend when XTUNER_DETERMINISTIC is enabled. This would make the non-deterministic path clearer and avoid accidentally seeding when determinism isn't intended.

if self.rollout_controller:
response_future = []
for sample in group_data_items:
for i, sample in enumerate(group_data_items):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: copy.deepcopy(sample_params) is now called unconditionally for every sample in the loop, whereas before it was only called in the partial rollout branch. This is a performance regression for the non-deterministic path — deep-copying a Pydantic BaseModel on every iteration is unnecessary overhead when the params aren't being mutated.

Consider guarding the deepcopy:

Suggested change
for i, sample in enumerate(group_data_items):
update_sample_params = copy.deepcopy(sample_params) if XTUNER_DETERMINISTIC else sample_params

for sample in group_data_items:
for i, sample in enumerate(group_data_items):
sample.data.extra_info["root_id"] = sample.uid.root_id
sample.data.extra_info["action_id"] = sample.uid.action_id
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: self.rollout_cfg is accessed here to get random_seed, but rollout_cfg can be None (see the __init__ signature at line 44: rollout_cfg=None). If XTUNER_DETERMINISTIC is True and rollout_cfg is None, this will raise AttributeError. Add a guard or assert that rollout_cfg is not None when deterministic mode is enabled.


self.server_func = launch_server
self.endpoints["health_generate"] = "health_generate"
self.endpoints["health_generate"] = "health"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Changing the health endpoint from "health_generate" to "health" is a behavioral change that isn't related to deterministic inference. The "health_generate" endpoint checks if the server can actually handle generate requests, while "health" may only confirm the HTTP server is up. This could mask readiness issues where the server is alive but not ready to serve inference.

Was this change intentional? If so, please document the reason (e.g., SGLang version compatibility).

sglang_server_args.enable_deterministic_inference = enable_deterministic_inference

if XTUNER_DETERMINISTIC:
sglang_server_args.enable_deterministic_inference = True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: rl_on_policy_target is not a recognized attribute of SGLang's ServerArgs as of the latest public SGLang release. This will raise an AttributeError (or be silently ignored if SGLang uses a permissive config model) on standard SGLang installations.

Is this targeting a custom/internal fork of SGLang? If so, please add a comment noting the required SGLang version or fork, so users don't hit runtime errors with standard SGLang.

Comment on lines +648 to +649
train_log_info = {}
eval_log_info = {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Initializing train_log_info = {} and eval_log_info = {} before the if not self._debug_rollout block fixes a potential UnboundLocalError when _debug_rollout is True. Good fix — but this is a pre-existing bug fix unrelated to deterministic training. Per the project's PR standards, consider splitting this into a separate [Fix] commit/PR to keep this PR focused on the deterministic feature.

Comment on lines +534 to +535

def _sort_rollout_outputs(self, data_groups, multimodal_train_infos=None):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: The _sort_rollout_outputs method lacks a docstring (required for public methods per project standards). Also, the multimodal_train_infos parameter should have a type annotation:

def _sort_rollout_outputs(self, data_groups: list[list[RLDataFlowItem]], multimodal_train_infos: list | None = None):

@@ -0,0 +1,117 @@
set -ex
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Missing #!/bin/bash shebang on line 1. The script starts with set -ex directly. Also, missing a newline at the end of the file (line 117).

Comment on lines +3 to +7
# examples of usage:
# qwen3_8B_grpo_gsm8k training:
# bash examples/v1/scripts/run_rl.sh examples/v1/config/rl_qwen3_8B_grpo.py "sglang" $MODEL_PATH $DATA_PATH $EVAL_DATA_PATH
# qwen2.5_7B_dapo_math training:
# bash examples/v1/scripts/run_rl.sh examples/v1/config/rl_qwen25_7B_dapo.py "sglang" $MODEL_PATH $DATA_PATH $EVAL_DATA_PATH
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: The usage comments reference run_rl.sh but this script is run_rl_deterministic.sh. Also, the usage shows 5 positional args (CONFIG_PATH, "sglang", MODEL_PATH, DATA_PATH, EVAL_DATA_PATH) but the script only accepts 4 (CONFIG_PATH, MODEL_PATH, DATA_PATH, EVAL_DATA_PATH) — it doesn't take an inference backend argument. These comments appear to be copy-pasted from run_rl.sh without updating.

export WORK_DIR="${DIR}/work_dirs/${model_dir_name}_${data_dir_name}_${infer_backend_lower}"
else
export WORK_DIR=$WORK_DIR
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: infer_backend_lower is used here but never defined in this script. This will produce a work directory path ending with an empty string (e.g., work_dirs/model_data_). Looks like it was copied from run_rl.sh where the inference backend variable exists.

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 31, 2026

Claude: ## Summary

This PR adds deterministic RL training support with SGLang by introducing seed propagation, deterministic ID generation, and output sorting. It also renames the test config to fix a typo (gsk8kgsm8k) and includes a new launcher script.

Issues

Critical

  • [single_turn_env.py:99] self.rollout_cfg.random_seed will raise AttributeError when rollout_cfg is None (which is a valid state per the constructor signature).
  • [sglang.py:183] rl_on_policy_target is not a recognized SGLang ServerArgs attribute in public releases — this will fail on standard SGLang installations.

Warning

  • [single_turn_env.py:97] copy.deepcopy(sample_params) is now called unconditionally for every sample, adding overhead to the non-deterministic path. Should be guarded by XTUNER_DETERMINISTIC.
  • [sglang.py:28] Health endpoint changed from health_generate to health — this weakens readiness checking and is unrelated to deterministic inference.
  • [rl_data.py:432] sampling_seed defaults to 0, which may be ambiguous (random vs actual seed) depending on backend. Consider int | None with None default.
  • [run_rl_deterministic.sh:56] infer_backend_lower variable is undefined — work dir path will be malformed.
  • [run_rl_deterministic.sh:3-7] Usage comments are copy-pasted from run_rl.sh with wrong arg count and script name.

Nit

  • [rl_trainer.py:534-535] _sort_rollout_outputs is a public method missing docstring and type annotations.
  • [rl_trainer.py:648-649] train_log_info/eval_log_info initialization fixes a pre-existing UnboundLocalError — consider splitting into a separate [Fix] PR.
  • [run_rl_deterministic.sh:1] Missing #!/bin/bash shebang and trailing newline.
  • [rl_qwen3_gsm8k_grpo.py:31] Pre-existing typo enbale_partial_rollout propagated into this config.
  • No tests are included for the new deterministic sorting/seeding logic.

Verdict

REQUEST_CHANGES

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants