support determinisitc RL training with sglang by YanhuiDua · Pull Request #1644 · InternLM/xtuner

YanhuiDua · 2026-03-30T14:08:14Z

add moe determisitic rl training

YanhuiDua · 2026-03-31T06:22:33Z

claude · 2026-03-31T06:25:32Z

autotest/config/rl_qwen3_gsm8k_grpo.py

 data_path = os.environ["DATA_PATH"]
 eval_data_path = os.environ["EVAL_DATA_PATH"]
 enable_evaluate = True if eval_data_path != "" else False
+enbale_partial_rollout = int(os.environ.get("ENBALE_PARTIAL_ROLLOUT", "0"))


Claude: Typo: enbale_partial_rollout should be enable_partial_rollout. The env var name ENBALE_PARTIAL_ROLLOUT has the same typo.

I see this typo is inherited from other config files (rl_qwen3_8B_grpo.py, rl_qwen25_7B_dapo.py), so it's a pre-existing issue. However, since this is a rename/cleanup of this config file, it would be a good opportunity to fix it here and in the env var name.

claude · 2026-03-31T06:25:36Z

xtuner/v1/data_proto/rl_data.py

    stops: Annotated[list[str], Parameter(help="List of stop sequences.")] = []
    stop_token_ids: Annotated[list[int], Parameter(help="List of stop token IDs.")] = []
    skip_special_tokens: Annotated[bool, Parameter(help="Whether to skip special tokens.")] = True
+    sampling_seed: Annotated[int, Parameter(help="The seed for random number generator in sampling.")] = 0


Claude: A default sampling_seed of 0 can be ambiguous — some backends treat 0 as "use random seed" while others treat it as an actual seed value. Consider using None as the default with int | None type, and only pass it to the backend when XTUNER_DETERMINISTIC is enabled. This would make the non-deterministic path clearer and avoid accidentally seeding when determinism isn't intended.

claude · 2026-03-31T06:25:40Z

xtuner/v1/ray/environment/single_turn_env.py

        if self.rollout_controller:
            response_future = []
-            for sample in group_data_items:
+            for i, sample in enumerate(group_data_items):


Claude: copy.deepcopy(sample_params) is now called unconditionally for every sample in the loop, whereas before it was only called in the partial rollout branch. This is a performance regression for the non-deterministic path — deep-copying a Pydantic BaseModel on every iteration is unnecessary overhead when the params aren't being mutated.

Consider guarding the deepcopy:

Suggested change

for i, sample in enumerate(group_data_items):

update_sample_params = copy.deepcopy(sample_params) if XTUNER_DETERMINISTIC else sample_params

claude · 2026-03-31T06:25:42Z

xtuner/v1/ray/environment/single_turn_env.py

-            for sample in group_data_items:
+            for i, sample in enumerate(group_data_items):
                sample.data.extra_info["root_id"] = sample.uid.root_id
                sample.data.extra_info["action_id"] = sample.uid.action_id


Claude: self.rollout_cfg is accessed here to get random_seed, but rollout_cfg can be None (see the __init__ signature at line 44: rollout_cfg=None). If XTUNER_DETERMINISTIC is True and rollout_cfg is None, this will raise AttributeError. Add a guard or assert that rollout_cfg is not None when deterministic mode is enabled.

claude · 2026-03-31T06:25:46Z

xtuner/v1/ray/rollout/sglang.py


        self.server_func = launch_server
-        self.endpoints["health_generate"] = "health_generate"
+        self.endpoints["health_generate"] = "health"


Claude: Changing the health endpoint from "health_generate" to "health" is a behavioral change that isn't related to deterministic inference. The "health_generate" endpoint checks if the server can actually handle generate requests, while "health" may only confirm the HTTP server is up. This could mask readiness issues where the server is alive but not ready to serve inference.

Was this change intentional? If so, please document the reason (e.g., SGLang version compatibility).

claude · 2026-03-31T06:25:51Z

xtuner/v1/ray/rollout/sglang.py

-        sglang_server_args.enable_deterministic_inference = enable_deterministic_inference
-
+        if XTUNER_DETERMINISTIC:
+            sglang_server_args.enable_deterministic_inference = True


Claude: rl_on_policy_target is not a recognized attribute of SGLang's ServerArgs as of the latest public SGLang release. This will raise an AttributeError (or be silently ignored if SGLang uses a permissive config model) on standard SGLang installations.

Is this targeting a custom/internal fork of SGLang? If so, please add a comment noting the required SGLang version or fork, so users don't hit runtime errors with standard SGLang.

claude · 2026-03-31T06:25:54Z

xtuner/v1/train/rl_trainer.py

+                train_log_info = {}
+                eval_log_info = {}


Claude: Initializing train_log_info = {} and eval_log_info = {} before the if not self._debug_rollout block fixes a potential UnboundLocalError when _debug_rollout is True. Good fix — but this is a pre-existing bug fix unrelated to deterministic training. Per the project's PR standards, consider splitting this into a separate [Fix] commit/PR to keep this PR focused on the deterministic feature.

claude · 2026-03-31T06:25:56Z

xtuner/v1/train/rl_trainer.py

+
+    def _sort_rollout_outputs(self, data_groups, multimodal_train_infos=None):


Claude: The _sort_rollout_outputs method lacks a docstring (required for public methods per project standards). Also, the multimodal_train_infos parameter should have a type annotation:

def _sort_rollout_outputs(self, data_groups: list[list[RLDataFlowItem]], multimodal_train_infos: list | None = None):

claude · 2026-03-31T06:25:58Z

examples/v1/scripts/run_rl_deterministic.sh

@@ -0,0 +1,117 @@
+set -ex


Claude: Missing #!/bin/bash shebang on line 1. The script starts with set -ex directly. Also, missing a newline at the end of the file (line 117).

claude · 2026-03-31T06:26:02Z

examples/v1/scripts/run_rl_deterministic.sh

+# examples of usage:
+# qwen3_8B_grpo_gsm8k training: 
+# bash examples/v1/scripts/run_rl.sh examples/v1/config/rl_qwen3_8B_grpo.py "sglang" $MODEL_PATH $DATA_PATH $EVAL_DATA_PATH
+# qwen2.5_7B_dapo_math training: 
+# bash examples/v1/scripts/run_rl.sh  examples/v1/config/rl_qwen25_7B_dapo.py "sglang" $MODEL_PATH $DATA_PATH $EVAL_DATA_PATH


Claude: The usage comments reference run_rl.sh but this script is run_rl_deterministic.sh. Also, the usage shows 5 positional args (CONFIG_PATH, "sglang", MODEL_PATH, DATA_PATH, EVAL_DATA_PATH) but the script only accepts 4 (CONFIG_PATH, MODEL_PATH, DATA_PATH, EVAL_DATA_PATH) — it doesn't take an inference backend argument. These comments appear to be copy-pasted from run_rl.sh without updating.

claude · 2026-03-31T06:26:05Z

examples/v1/scripts/run_rl_deterministic.sh

+  export WORK_DIR="${DIR}/work_dirs/${model_dir_name}_${data_dir_name}_${infer_backend_lower}"
+else
+  export WORK_DIR=$WORK_DIR
+fi


Claude: infer_backend_lower is used here but never defined in this script. This will produce a work directory path ending with an empty string (e.g., work_dirs/model_data_). Looks like it was copied from run_rl.sh where the inference backend variable exists.

claude · 2026-03-31T06:27:31Z

Claude: ## Summary

This PR adds deterministic RL training support with SGLang by introducing seed propagation, deterministic ID generation, and output sorting. It also renames the test config to fix a typo (gsk8k → gsm8k) and includes a new launcher script.

Issues

Critical

[single_turn_env.py:99] self.rollout_cfg.random_seed will raise AttributeError when rollout_cfg is None (which is a valid state per the constructor signature).
[sglang.py:183] rl_on_policy_target is not a recognized SGLang ServerArgs attribute in public releases — this will fail on standard SGLang installations.

Warning

[single_turn_env.py:97] copy.deepcopy(sample_params) is now called unconditionally for every sample, adding overhead to the non-deterministic path. Should be guarded by XTUNER_DETERMINISTIC.
[sglang.py:28] Health endpoint changed from health_generate to health — this weakens readiness checking and is unrelated to deterministic inference.
[rl_data.py:432] sampling_seed defaults to 0, which may be ambiguous (random vs actual seed) depending on backend. Consider int | None with None default.
[run_rl_deterministic.sh:56] infer_backend_lower variable is undefined — work dir path will be malformed.
[run_rl_deterministic.sh:3-7] Usage comments are copy-pasted from run_rl.sh with wrong arg count and script name.

Nit

[rl_trainer.py:534-535] _sort_rollout_outputs is a public method missing docstring and type annotations.
[rl_trainer.py:648-649] train_log_info/eval_log_info initialization fixes a pre-existing UnboundLocalError — consider splitting into a separate [Fix] PR.
[run_rl_deterministic.sh:1] Missing #!/bin/bash shebang and trailing newline.
[rl_qwen3_gsm8k_grpo.py:31] Pre-existing typo enbale_partial_rollout propagated into this config.
No tests are included for the new deterministic sorting/seeding logic.

Verdict

REQUEST_CHANGES

support determinisitc RL training with sglang

c301619

YanhuiDua force-pushed the determ_infer branch from 8b769ec to c301619 Compare March 31, 2026 04:59

YanhuiDua requested review from hhaAndroid and kkscilife March 31, 2026 05:07

claude bot reviewed Mar 31, 2026

View reviewed changes

hhaAndroid approved these changes Apr 1, 2026

View reviewed changes

add sglang version demand

ee94f78

	for i, sample in enumerate(group_data_items):
	update_sample_params = copy.deepcopy(sample_params) if XTUNER_DETERMINISTIC else sample_params


		def _sort_rollout_outputs(self, data_groups, multimodal_train_infos=None):

Conversation

YanhuiDua commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

YanhuiDua commented Mar 31, 2026

Uh oh!

claude bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot commented Mar 31, 2026

Issues

Critical

Warning

Nit

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

YanhuiDua commented Mar 30, 2026 •

edited

Loading