Skip to content

docs: add Qwen3.5 deployment cookbook (EN/CN)#1248

Open
sufubao wants to merge 4 commits intomainfrom
qw35_cookbook
Open

docs: add Qwen3.5 deployment cookbook (EN/CN)#1248
sufubao wants to merge 4 commits intomainfrom
qw35_cookbook

Conversation

@sufubao
Copy link
Copy Markdown
Collaborator

@sufubao sufubao commented Apr 1, 2026

Summary

  • Add bilingual (English + Chinese) Qwen3.5 deployment cookbook covering model variants (qwen3_5, qwen3_5_moe)
  • Include launch scripts for text-only dense, multimodal, MoE, and high-performance H200 configurations
  • Document thinking/reasoning mode (--reasoning_parser qwen3), FP8 KV quantization, multimodal image input, and hardware requirements
  • Register new cookbooks in both EN and CN documentation index

Test plan

  • Verify RST renders correctly with sphinx-build
  • Confirm all launch command parameters are valid against api_server args
  • Test example curl commands against a running Qwen3.5 instance

Add bilingual cookbook covering text-only, multimodal, MoE, and
high-performance launch configurations, thinking/reasoning mode,
FP8 KV quantization, and hardware recommendations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces deployment guides for the Qwen3.5 model family, including dense and Mixture-of-Experts (MoE) variants, in both Chinese and English. The documentation covers model features, recommended launch scripts for different hardware setups, reasoning mode support, and KV cache quantization. Feedback from the review points out that the model types qwen3_5_text and qwen3_5_moe_text are not registered in the codebase and should be replaced with base model names. Additionally, the review suggests correcting the KV cache quantization parameter from --data_type to --kv_cache_quant_type and clarifying the usage of calibration configurations.

Comment on lines +36 to +41
* - ``qwen3_5_text``
- 稠密 + 纯文本
- 稠密 MLP,无视觉编码器
* - ``qwen3_5_moe_text``
- MoE + 纯文本
- 混合专家模型,无视觉编码器
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

模型类型 qwen3_5_textqwen3_5_moe_text 在当前代码库中并未注册(参见 lightllm/models/qwen3_5/model.pylightllm/models/qwen3_5_moe/model.py)。在启动命令中使用这些名称作为 --model_type 参数会导致错误。建议在文档中说明:对于纯文本模式,应使用基础模型名称(qwen3_5qwen3_5_moe)且不启用 --enable_multimodal 标志。


.. code-block:: bash

--data_type fp8_e4m3
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

在 LightLLM 中,启用 KV 缓存量化的参数通常是 --kv_cache_quant_type,而 --data_type 通常用于指定模型权重和激活的精度(例如将整个模型以 FP8 加载)。由于本节专门讨论 KV 缓存量化,建议使用 --kv_cache_quant_type。此外,文档中提到的 JSON 校准配置应说明如何传递给服务器(例如通过 --quant_config_path 参数)。

Suggested change
--data_type fp8_e4m3
--kv_cache_quant_type fp8_e4m3

Comment on lines +36 to +41
* - ``qwen3_5_text``
- Dense + Text-only
- Dense MLP without vision encoder
* - ``qwen3_5_moe_text``
- MoE + Text-only
- Mixture-of-Experts without vision encoder
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The model types qwen3_5_text and qwen3_5_moe_text are not registered in the current codebase (see lightllm/models/qwen3_5/model.py and lightllm/models/qwen3_5_moe/model.py). Using these names in the --model_type argument will result in an error. It is recommended to clarify that for text-only mode, the base model names (qwen3_5, qwen3_5_moe) should be used without the --enable_multimodal flag.


.. code-block:: bash

--data_type fp8_e4m3
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In LightLLM, the parameter for enabling KV cache quantization is typically --kv_cache_quant_type. The --data_type flag is generally used to specify the precision of the model weights and activations. Since this section is specifically about KV cache quantization, --kv_cache_quant_type should be used instead. Also, please clarify how the calibration JSON configuration should be passed to the server (e.g., via --quant_config_path).

Suggested change
--data_type fp8_e4m3
--kv_cache_quant_type fp8_e4m3

sufubao and others added 3 commits April 1, 2026 06:21
Remove qwen3_5_text and qwen3_5_moe_text from the supported model
types table since they are not registered via @ModelRegistry. Clarify
that text-only mode uses the same model type without --enable_multimodal.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…okbook

Qwen3.5 models are registered as multimodal by default, so
--enable_multimodal is not a user-facing CLI flag. For text-only
deployment, use --disable_vision instead. For multimodal deployment,
no extra flag is needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rewrite both EN/CN cookbooks to use the real model Qwen3.5-397B-A17B
(397B total / 17B active MoE) instead of fictional model names like
Qwen3.5-VL or Qwen3.5-MoE. Add HuggingFace link, accurate architecture
details (512 experts, 60-layer hybrid layout), recommended sampling
parameters for thinking/non-thinking modes, and proper 8×H200 setup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant