Skip to content

Conversation

@jackyYang6
Copy link
Contributor

@jackyYang6 jackyYang6 commented Feb 5, 2026

Motivation

引入 ThinkingBudget logits processor 用于限制 段长度,同时补充中英文使用文档与单测。

Modifications

  • 新增 ThinkingBudgetLogitsProcessor 并注册为内置 logits processor,用于控制思考长度。
  • DataProcessor 阶段进行 <think> prompt 状态预计算,避免首步 GPU 扫描 prompt。
  • 打通 Engine/SamplingParams 的参数透传链路。
  • 新增 thinking budget 行为单测。
  • 新增中英文功能文档并更新 mkdocs 导航。

Usage or Command

启动服务:

python -m fastdeploy.entrypoints.openai.api_server \
  --model Qwen/Qwen3-0.6B \
  --port 8180 \
  --metrics-port 8181 \
  --engine-worker-queue-port 8182 \
  --max-model-len 32768 \
  --max-num-seqs 32 \
  --logits-processors ThinkingBudgetLogitsProcessor

请求示例:

curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "你好!"}],
    "max_completion_tokens": 30,
    "logits_processors_args": {
      "thinking_budget": 20,
      "think_stop_sentence": "思考已达上限,开始回复"
    }
  }'

Benchmark

测试环境:

  • GPU:A30 24GB(单卡)
  • 模型:Qwen3-0.6B
  • 数据集: filtered_sharedgpt_2000_input_1136_output_200_fd.json
  • Baseline:不启用 --logits-processors
  • ThinkingBudget:启用 ThinkingBudgetLogitsProcessorthinking_budget=20000,不触发预算逻辑)

并发 64 / 500 请求

指标 Baseline ThinkingBudget 变化
Output throughput (tok/s) 1947.53 1737.76 -10.77%
Mean Decode (tok/s) 45.64 44.50 -2.50%
Mean TTFT (ms) 279.41 286.99 +2.71%
Mean TPOT (ms) 21.99 22.56 +2.59%

并发 1 / 50 请求

指标 Baseline ThinkingBudget 变化
Output throughput (tok/s) 212.18 202.43 -4.59%
Mean Decode (tok/s) 216.38 205.87 -4.86%
Mean TTFT (ms) 53.85 55.37 +2.82%
Mean TPOT (ms) 4.63 4.86 +4.97%

结论

  • 最新单并发结果显示,TTFT 回退约 +2.8%,TPOT/ITL 回退约 +5%,属于可接受但有可优化空间。
  • 高并发下吞吐仍有明显回退(Output/Total tok/s 约 -11%~-13%),这是当前方案的主要性能代价。

Accuracy Tests

不涉及 kernel 或模型 forward 逻辑变更,暂无精度回归需求。

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link

paddle-bot bot commented Feb 5, 2026

Thanks for your contribution!

@codecov-commenter
Copy link

codecov-commenter commented Feb 5, 2026

Codecov Report

❌ Patch coverage is 84.73118% with 71 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@fd56d85). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...model_executor/logits_processor/thinking_budget.py 84.61% 14 Missing and 22 partials ⚠️
fastdeploy/input/v1/text_processor.py 81.52% 8 Missing and 9 partials ⚠️
fastdeploy/engine/common_engine.py 65.00% 2 Missing and 5 partials ⚠️
fastdeploy/engine/engine.py 65.00% 2 Missing and 5 partials ⚠️
fastdeploy/input/text_processor.py 95.83% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #6367   +/-   ##
==========================================
  Coverage           ?   68.18%           
==========================================
  Files              ?      392           
  Lines              ?    52907           
  Branches           ?     8300           
==========================================
  Hits               ?    36076           
  Misses             ?    14179           
  Partials           ?     2652           
Flag Coverage Δ
GPU 68.18% <84.73%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 引入 ThinkingBudgetLogitsProcessor,用于在解码阶段对 <think>...</think> 思考段生成长度做硬性限制,并打通从 DataProcessor → Engine → Worker → ModelConfig 的相关参数链路,同时补充中英文文档与单测。

Changes:

  • 新增 ThinkingBudgetLogitsProcessor 并注册为内置 logits processor,支持预算触发后强制换行与 </think>(可选强制 stop sentence)。
  • text_processor / v1/text_processor 侧预计算 prompt 的 <think> 状态,并补充 think_stop_sentence → token ids 的处理逻辑。
  • Engine 启动 worker 时补齐 think_start_id / line_break_id 获取与透传,并新增文档与大规模单测覆盖。

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/model_executor/test_thinking_budget.py 覆盖 thinking budget 的关键状态机与 DataProcessor/Engine 相关分支的单测
mkdocs.yml 将 Thinking Budget 文档加入 mkdocs 导航
fastdeploy/worker/worker_process.py worker 参数解析新增 --think_start_id
fastdeploy/model_executor/logits_processor/thinking_budget.py 新增 ThinkingBudgetLogitsProcessor 的核心实现
fastdeploy/model_executor/logits_processor/init.py 注册并导出 ThinkingBudgetLogitsProcessor
fastdeploy/input/v1/text_processor.py v1 侧增加 prompt thinking 状态预计算与 stop sentence tokenization(并新增编码缓存)
fastdeploy/input/text_processor.py 非 v1 侧增加同等能力(并新增编码缓存与 get_eos_token_id import fallback)
fastdeploy/engine/engine.py 获取并透传 think_start_id,补齐 line_break_id 获取逻辑
fastdeploy/engine/common_engine.py 同 engine.py 的 worker 启动透传与 line_break_id 处理
fastdeploy/config.py ModelConfig 增加 think_start_id 字段承接 worker 入参
docs/zh/features/thinking_budget.md 中文功能文档
docs/features/thinking_budget.md 英文功能文档

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants