Improve reasoning and tool-call parsers#4468
Conversation
There was a problem hiding this comment.
Pull request overview
This PR refactors streaming reasoning/tool-call parsing by centralizing “previous/current text + token ids” into a shared, request-attached streaming state and updates the OpenAI API server + parsers to use the new streaming interfaces.
Changes:
- Introduces
StreamingParserState+get_streaming_state(request)and a sharedThinkingReasoningParserbase for<think>...</think>-style models. - Simplifies streaming parser method signatures (tool + reasoning) to accept deltas only and read previous/current context from request-attached state.
- Renames/reshuffles parser modules (
*_tool_parser.py,qwen_reasoning_parser.py) and updates imports accordingly.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_lmdeploy/test_qwen3coder_parser.py | Updates tool parser import path (but streaming call sites still need updating to new API). |
| tests/test_lmdeploy/test_qwen3_parser.py | Updates reasoning/tool parser import paths (but streaming call sites still need updating to new API). |
| lmdeploy/serve/openai/tool_parser/tool_parser.py | Updates abstract streaming tool parser signature + docs to rely on request-attached streaming state. |
| lmdeploy/serve/openai/tool_parser/qwen3coder_tool_parser.py | Adapts Qwen3Coder streaming parsing to read current_text via get_streaming_state. |
| lmdeploy/serve/openai/tool_parser/qwen3_tool_parser.py | Adapts Qwen3 streaming parsing to read current_text via get_streaming_state. |
| lmdeploy/serve/openai/tool_parser/qwen2d5_tool_parser.py | Adapts streaming parsing to shared state (but still uses parser-instance state). |
| lmdeploy/serve/openai/tool_parser/llama3_tool_parser.py | Adapts streaming parsing to shared state (but still uses parser-instance state). |
| lmdeploy/serve/openai/tool_parser/internlm2_tool_parser.py | Adapts streaming parsing to shared state (but still uses parser-instance state). |
| lmdeploy/serve/openai/tool_parser/init.py | Updates exports to new *_tool_parser.py module names. |
| lmdeploy/serve/openai/reasoning_parser/reasoning_parser.py | Adds shared streaming state + ThinkingReasoningParser base; updates reasoning parser streaming API. |
| lmdeploy/serve/openai/reasoning_parser/qwen_reasoning_parser.py | New Qwen QwQ / Intern-S1 reasoning parser based on ThinkingReasoningParser. |
| lmdeploy/serve/openai/reasoning_parser/qwen_qwq_reasoning_parser.py | Removes the prior Qwen QwQ reasoning parser implementation. |
| lmdeploy/serve/openai/reasoning_parser/deepseek_r1_reasoning_parser.py | Refactors DeepSeek-R1 reasoning parser onto ThinkingReasoningParser. |
| lmdeploy/serve/openai/reasoning_parser/init.py | Updates exports to include new shared state utilities + new module path. |
| lmdeploy/serve/openai/api_server.py | Uses StreamingParserState in the streaming loop and calls updated parser streaming APIs. |
Comments suppressed due to low confidence (3)
lmdeploy/serve/openai/tool_parser/qwen2d5_tool_parser.py:52
Qwen2d5ToolParserstores streaming parse state (self.position,self.current_tool_id,self.current_tool_name_sent, etc.) on the parser instance, but the server uses a single global parser instance across requests. This is not safe under concurrent streaming requests and can lead to mixed tool-call parsing. Please move per-request state to the request object (e.g.,request._tool_parser_state) or another request-scoped structure.
lmdeploy/serve/openai/tool_parser/llama3_tool_parser.py:83Llama3JsonToolParserrelies on instance attributes (self.current_tool_id,self.current_tool_name_sent,self.prev_tool_call_arr, etc.) for streaming state. Since the API server stores a single global parser instance, concurrent streaming requests can race and corrupt this state. Please migrate the streaming state to be request-scoped (e.g., attach a parser-state object torequest).
lmdeploy/serve/openai/tool_parser/internlm2_tool_parser.py:56Internlm2ToolParserkeeps streaming parse state (self.position,self.current_tool_id,self.current_tool_name_sent, etc.) on the parser instance, butVariableInterface.tool_parseris a single global instance shared across requests. With concurrent streaming requests, this state can interleave and corrupt parsing. Consider moving per-request state onto the request object (e.g.,request._tool_parser_state) similar toQwen3ToolParser, or into the sharedStreamingParserState.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| """Base class for reasoning parsers that use <think>...</think> style tags. | ||
|
|
||
| Subclasses only need to set `start_token`, `end_token`, and optionally | ||
| override `strip_newlines` and `on_missing_start_tag` to customize behavior. |
There was a problem hiding this comment.
Docstring mentions overriding on_missing_start_tag, but the actual configurable attribute is on_missing_end_tag. Please fix the docstring to avoid confusing subclass authors.
| override `strip_newlines` and `on_missing_start_tag` to customize behavior. | |
| override `strip_newlines` and `on_missing_end_tag` to customize behavior. |
| from lmdeploy.serve.openai.protocol import (ChatCompletionRequest, ChatCompletionResponse, ChatCompletionResponseChoice, | ||
| ChatCompletionResponseStreamChoice, ChatCompletionStreamResponse, | ||
| ChatMessage, DeltaMessage, DeltaToolCall, UsageInfo) | ||
| from lmdeploy.serve.openai.reasoning_parser.qwen_qwq_reasoning_parser import QwenQwQReasoningParser | ||
| from lmdeploy.serve.openai.tool_parser.qwen3_parser import Qwen3ToolParser | ||
| from lmdeploy.serve.openai.reasoning_parser.qwen_reasoning_parser import QwenQwQReasoningParser | ||
| from lmdeploy.serve.openai.tool_parser.qwen3_tool_parser import Qwen3ToolParser |
There was a problem hiding this comment.
This test file still calls extract_tool_calls_streaming(...) / extract_reasoning_content_streaming(...) with the removed previous_text/current_text and previous_token_ids/current_token_ids parameters (see _chat_completion_v1 below). Update the test harness to match the new parser API (and to update/step the shared StreamingParserState via get_streaming_state(request) similarly to api_server.py).
| @@ -10,7 +10,7 @@ | |||
| from lmdeploy.serve.openai.protocol import (ChatCompletionRequest, ChatCompletionResponse, ChatCompletionResponseChoice, | |||
| ChatCompletionResponseStreamChoice, ChatCompletionStreamResponse, | |||
| ChatMessage, DeltaMessage, DeltaToolCall, UsageInfo) | |||
| from lmdeploy.serve.openai.tool_parser.qwen3coder_parser import Qwen3CoderToolParser | |||
| from lmdeploy.serve.openai.tool_parser.qwen3coder_tool_parser import Qwen3CoderToolParser | |||
There was a problem hiding this comment.
This test file still calls extract_tool_calls_streaming(...) / extract_reasoning_content_streaming(...) using the old signature with previous_text/current_text and token id args (see _chat_completion_v1 below). Update the test harness to the new streaming parser API and maintain the shared StreamingParserState on the request (as api_server.py now does).
| previous_text: str = '' | ||
| current_text: str = '' | ||
| previous_token_ids: list[int] = [] | ||
| current_token_ids: list[int] = [] |
There was a problem hiding this comment.
StreamingParserState uses mutable list defaults (previous_token_ids / current_token_ids). In a dataclass these lists are shared across all instances, which can leak token ids between requests. Use field(default_factory=list) for these fields.
| def step(self) -> None: | ||
| """Advance: copy current -> previous (call at end of each iteration).""" | ||
| self.previous_text = self.current_text | ||
| self.previous_token_ids = self.current_token_ids | ||
|
|
There was a problem hiding this comment.
StreamingParserState.step() assigns previous_token_ids = current_token_ids, which aliases the same list. Because update() mutates current_token_ids in place, previous_token_ids will also change, breaking any logic that compares previous vs current tokens (and can cause incorrect detection of start/end tags). Snapshot via .copy() (or store lengths / use an immutable sequence).
| def extract_reasoning_content(self, model_output: str, request: ChatCompletionRequest, **kwargs) -> tuple[str, str]: | ||
| """Extract reasoning content from a complete model-generated string. | ||
|
|
||
| Used for non-streaming responses where we have the entire model response | ||
| available before sending to the client. | ||
|
|
||
| Args: | ||
| model_output (str): The model-generated string to extract reasoning content from. | ||
| request (ChatCompletionRequest): he request object that was used to generate the model_output. | ||
| model_output: The model-generated string to extract reasoning content from. | ||
| request: The request object that was used to generate the model_output. | ||
|
|
||
| Returns: | ||
| reasoning_content (str | None): The reasoning content. | ||
| final_output (str | None): The content. | ||
| A tuple of (reasoning_content, final_output). Either may be None. | ||
| """ |
There was a problem hiding this comment.
The return type annotation for extract_reasoning_content is tuple[str, str], but the docstring and implementations return None for either element (e.g., ThinkingReasoningParser.extract_reasoning_content returns (None, model_output) and vice versa). Update the annotation to allow None (e.g., tuple[str | None, str | None]) to match behavior.
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
Motivation
Please describe the motivation of this PR and the goal you want to achieve through this PR.
Modification
Please briefly describe what modification is made in this PR.
BC-breaking (Optional)
Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.
Use cases (Optional)
If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.
Checklist