diff --git a/docs/CN/source/cookbook/qwen35_deployment.rst b/docs/CN/source/cookbook/qwen35_deployment.rst new file mode 100644 index 0000000000..4cb6bf93e4 --- /dev/null +++ b/docs/CN/source/cookbook/qwen35_deployment.rst @@ -0,0 +1,223 @@ +.. _qwen35_deployment: + +Qwen3.5 模型部署指南 +===================== + +LightLLM 支持 Qwen3.5 模型系列的部署。本指南以 `Qwen3.5-397B-A17B `_ 为例,介绍部署配置、思考/推理模式、多模态输入及推荐启动参数。 + +模型概述 +-------- + +Qwen3.5-397B-A17B 是一个多模态混合专家模型,总参数量 397B,每个 token 激活 17B 参数。原生支持文本、图像和视频理解。 + +**主要特性:** + +- **混合注意力架构**:60 层排列为 15 个重复组 ``[3 × (Gated DeltaNet → MoE) → 1 × (Gated Attention → MoE)]``,交替使用线性注意力与全注意力(通过 ``full_attention_interval`` 控制) +- **稀疏 MoE**:共 512 个专家,每个 token 激活 10 个路由专家 + 1 个共享专家 +- **原生多模态**:内置视觉编码器,支持图像和视频理解,无需单独的 "-VL" 变体 +- **长上下文**:原生支持 262K 上下文,通过 YaRN 缩放可扩展至 1M+ tokens +- **多头旋转位置编码(MRoPE)**:交错旋转位置编码,``mrope_section=[11, 11, 10]``,用于空间/时间定位 +- **思考/推理模式**:支持 ``qwen3`` 推理解析器,使用 ``...`` 标签(默认启用) + +**已注册的模型类型:** + +.. list-table:: + :header-rows: 1 + :widths: 30 30 40 + + * - 模型类型 + - 架构 + - 说明 + * - ``qwen3_5`` + - 稠密 + 多模态 + - 稠密 MLP,带视觉编码器 + * - ``qwen3_5_moe`` + - MoE + 多模态 + - 混合专家模型,带视觉编码器 + +.. note:: + + Qwen3.5 模型默认注册为多模态模型,多模态支持自动启用。若需纯文本部署,添加 ``--disable_vision`` 以跳过视觉编码器的加载,减少显存占用和启动时间。 + +推荐启动脚本 +-------------- + +Qwen3.5-397B-A17B(8×H200) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +在 8 张 GPU 上部署完整的多模态 MoE 模型: + +.. code-block:: bash + + LIGHTLLM_TRITON_AUTOTUNE_LEVEL=1 LOADWORKER=18 \ + python -m lightllm.server.api_server \ + --model_dir /path/to/Qwen3.5-397B-A17B/ \ + --tp 8 \ + --max_req_total_len 262144 \ + --chunked_prefill_size 8192 \ + --llm_prefill_att_backend fa3 \ + --llm_decode_att_backend flashinfer \ + --graph_max_batch_size 128 \ + --reasoning_parser qwen3 \ + --host 0.0.0.0 \ + --port 8000 + +**参数说明:** + +- ``LIGHTLLM_TRITON_AUTOTUNE_LEVEL=1``: 启用 Triton 自动调优以获得最佳内核性能 +- ``LOADWORKER=18``: 模型加载线程数,加快权重加载速度 +- ``--tp 8``: 8 卡张量并行(397B 参数模型必需) +- ``--max_req_total_len 262144``: 最大请求总长度,与模型原生 262K 上下文匹配 +- ``--chunked_prefill_size 8192``: 预填充处理的分块大小,降低峰值显存占用 +- ``--llm_prefill_att_backend fa3``: 预填充阶段使用 FlashAttention3(推荐 H200) +- ``--llm_decode_att_backend flashinfer``: 解码阶段使用 FlashInfer +- ``--graph_max_batch_size 128``: CUDA graph 最大批处理大小(显存不足时可减小) +- ``--reasoning_parser qwen3``: 启用 Qwen3 推理解析器,支持思考模式 + +纯文本模式(节省显存) +~~~~~~~~~~~~~~~~~~~~~~~ + +跳过视觉编码器加载以减少显存占用: + +.. code-block:: bash + + LIGHTLLM_TRITON_AUTOTUNE_LEVEL=1 LOADWORKER=18 \ + python -m lightllm.server.api_server \ + --model_dir /path/to/Qwen3.5-397B-A17B/ \ + --tp 8 \ + --max_req_total_len 262144 \ + --chunked_prefill_size 8192 \ + --llm_prefill_att_backend fa3 \ + --llm_decode_att_backend flashinfer \ + --graph_max_batch_size 128 \ + --reasoning_parser qwen3 \ + --disable_vision \ + --host 0.0.0.0 \ + --port 8000 + +唯一区别是 ``--disable_vision``,阻止加载视觉编码器。此模式下模型仅接受文本输入。 + +思考/推理模式 +------------- + +Qwen3.5 默认启用思考模式。模型在生成最终答案之前,会在 ``...`` 标签内生成思维链推理过程。 + +**启用推理模式:** + +在启动命令中添加 ``--reasoning_parser qwen3``(以上所有示例均已包含)。使用 OpenAI 兼容 API 时,在请求中设置 ``separate_reasoning: true`` 可单独获取思考内容: + +.. code-block:: bash + + curl http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "Qwen3.5-397B-A17B", + "messages": [{"role": "user", "content": "请逐步求解:23 * 47 等于多少?"}], + "max_tokens": 500, + "separate_reasoning": true + }' + +响应中将包含 ``reasoning_content`` 字段(模型思考过程)和 ``content`` 字段(最终答案)。 + +**针对特定请求禁用思考:** + +若需要更快的响应速度,可在请求中设置 ``enable_thinking: false`` 以使用非思考模式: + +.. code-block:: bash + + curl http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "Qwen3.5-397B-A17B", + "messages": [{"role": "user", "content": "你好"}], + "max_tokens": 100, + "enable_thinking": false + }' + +**推荐采样参数:** + +.. list-table:: + :header-rows: 1 + :widths: 30 35 35 + + * - 参数 + - 思考模式 + - 非思考模式 + * - temperature + - 0.6 + - 0.7 + * - top_p + - 0.95 + - 0.8 + * - top_k + - 20 + - 20 + * - presence_penalty + - 0.0 + - 1.5 + +测试与验证 +---------- + +基础功能测试 +~~~~~~~~~~~~ + +.. code-block:: bash + + curl http://localhost:8000/generate \ + -H "Content-Type: application/json" \ + -d '{ + "inputs": "什么是人工智能?", + "parameters":{ + "max_new_tokens": 100, + "frequency_penalty": 1 + } + }' + +OpenAI 兼容聊天接口 +~~~~~~~~~~~~~~~~~~~ + +.. code-block:: bash + + curl http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "Qwen3.5-397B-A17B", + "messages": [{"role": "user", "content": "你好"}], + "max_tokens": 100, + "temperature": 0.7, + "top_p": 0.8, + "enable_thinking": false + }' + +多模态测试(图像输入) +~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: bash + + curl http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "Qwen3.5-397B-A17B", + "messages": [ + { + "role": "user", + "content": [ + {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}, + {"type": "text", "text": "请描述这张图片。"} + ] + } + ], + "max_tokens": 200 + }' + +硬件要求 +-------- + +**Qwen3.5-397B-A17B:** + +- 总参数量 397B,每个 token 激活 17B(512 个专家,10 路由 + 1 共享) +- **最低要求**:8× NVIDIA H100/H200 GPU(每卡 80GB HBM),需 NVLink 互联 +- 必须使用 ``--tp 8`` 以将模型权重分布到各 GPU +- 如遇到显存不足,可减小 ``--max_req_total_len`` 或 ``--graph_max_batch_size`` +- 使用 ``--data_type fp8_e4m3`` 进行 FP8 KV 量化可进一步降低显存压力 diff --git a/docs/CN/source/index.rst b/docs/CN/source/index.rst index 06f694127a..8f79e5126f 100755 --- a/docs/CN/source/index.rst +++ b/docs/CN/source/index.rst @@ -64,6 +64,7 @@ Lightllm 整合了众多的开源方案的优点,包括但不限于 FasterTran :caption: Cookbook GLM-4.7-Flash 部署 + Qwen3.5 部署 .. toctree:: :maxdepth: 1 diff --git a/docs/EN/source/cookbook/qwen35_deployment.rst b/docs/EN/source/cookbook/qwen35_deployment.rst new file mode 100644 index 0000000000..6b3b56252d --- /dev/null +++ b/docs/EN/source/cookbook/qwen35_deployment.rst @@ -0,0 +1,224 @@ +.. _qwen35_deployment: + +Qwen3.5 Model Deployment Guide +=============================== + +LightLLM supports deployment of the Qwen3.5 model family. This guide uses `Qwen3.5-397B-A17B `_ as an example, covering deployment configuration, thinking/reasoning mode, multimodal input, and recommended launch parameters. + +Model Overview +-------------- + +Qwen3.5-397B-A17B is a multimodal Mixture-of-Experts model with 397B total parameters and 17B active parameters per token. It natively supports text, image, and video understanding. + +**Key Features:** + +- **Hybrid Attention Architecture**: 60 layers arranged as 15 repeating groups of ``[3 × (Gated DeltaNet → MoE) → 1 × (Gated Attention → MoE)]``, alternating between linear attention and full attention (controlled by ``full_attention_interval``) +- **Sparse MoE**: 512 total experts, 10 routed + 1 shared expert activated per token +- **Native Multimodal**: Built-in vision encoder for image and video understanding — no separate "-VL" variant needed +- **Long Context**: 262K native context, extensible to 1M+ tokens with YaRN scaling +- **Multi-head RoPE (MRoPE)**: Interleaved rotary position embeddings with ``mrope_section=[11, 11, 10]`` for spatial/temporal positioning +- **Thinking/Reasoning Mode**: Supports ``qwen3`` reasoning parser with ``...`` tags (enabled by default) + +**Registered Model Types:** + +.. list-table:: + :header-rows: 1 + :widths: 30 30 40 + + * - Model Type + - Architecture + - Description + * - ``qwen3_5`` + - Dense + Multimodal + - Dense MLP with vision encoder + * - ``qwen3_5_moe`` + - MoE + Multimodal + - Mixture-of-Experts with vision encoder + +.. note:: + + Qwen3.5 models are registered as multimodal by default. Multimodal support is automatically enabled unless explicitly disabled. For text-only deployment, add ``--disable_vision`` to skip loading the vision encoder, which reduces memory usage and startup time. + +Recommended Launch Scripts +-------------------------- + +Qwen3.5-397B-A17B (8×H200) +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Deploy the full multimodal MoE model on 8 GPUs: + +.. code-block:: bash + + LIGHTLLM_TRITON_AUTOTUNE_LEVEL=1 LOADWORKER=18 \ + python -m lightllm.server.api_server \ + --model_dir /path/to/Qwen3.5-397B-A17B/ \ + --tp 8 \ + --max_req_total_len 262144 \ + --chunked_prefill_size 8192 \ + --llm_prefill_att_backend fa3 \ + --llm_decode_att_backend flashinfer \ + --graph_max_batch_size 128 \ + --reasoning_parser qwen3 \ + --host 0.0.0.0 \ + --port 8000 + +**Parameter Description:** + +- ``LIGHTLLM_TRITON_AUTOTUNE_LEVEL=1``: Enable Triton autotuning for optimal kernel performance +- ``LOADWORKER=18``: Number of model loading threads for faster weight loading +- ``--tp 8``: Tensor parallelism across 8 GPUs (required for 397B parameter model) +- ``--max_req_total_len 262144``: Maximum total request length matching the model's native 262K context +- ``--chunked_prefill_size 8192``: Chunk size for prefill processing, reduces peak memory usage +- ``--llm_prefill_att_backend fa3``: Use FlashAttention3 for prefill (recommended for H200) +- ``--llm_decode_att_backend flashinfer``: Use FlashInfer for decode phase +- ``--graph_max_batch_size 128``: Maximum batch size for CUDA graph optimization (reduce if OOM) +- ``--reasoning_parser qwen3``: Enable Qwen3 reasoning parser for thinking mode + +Text-only Mode (Save Memory) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To skip loading the vision encoder and reduce memory usage: + +.. code-block:: bash + + LIGHTLLM_TRITON_AUTOTUNE_LEVEL=1 LOADWORKER=18 \ + python -m lightllm.server.api_server \ + --model_dir /path/to/Qwen3.5-397B-A17B/ \ + --tp 8 \ + --max_req_total_len 262144 \ + --chunked_prefill_size 8192 \ + --llm_prefill_att_backend fa3 \ + --llm_decode_att_backend flashinfer \ + --graph_max_batch_size 128 \ + --reasoning_parser qwen3 \ + --disable_vision \ + --host 0.0.0.0 \ + --port 8000 + +The only difference is ``--disable_vision``, which prevents the vision encoder from being loaded. The model will only accept text input in this mode. + +Thinking/Reasoning Mode +----------------------- + +Qwen3.5 has thinking mode enabled by default. The model generates chain-of-thought reasoning inside ``...`` tags before producing the final answer. + +**Enabling Reasoning Mode:** + +Add ``--reasoning_parser qwen3`` to your launch command (included in all examples above). When using the OpenAI-compatible API, set ``separate_reasoning: true`` in the request to receive thinking content separately: + +.. code-block:: bash + + curl http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "Qwen3.5-397B-A17B", + "messages": [{"role": "user", "content": "Solve step by step: what is 23 * 47?"}], + "max_tokens": 500, + "separate_reasoning": true + }' + +The response will include a ``reasoning_content`` field with the model's thinking process and a ``content`` field with the final answer. + +**Disabling Thinking for Specific Requests:** + +To use the model in non-thinking mode for faster responses, set ``enable_thinking: false`` in the request: + +.. code-block:: bash + + curl http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "Qwen3.5-397B-A17B", + "messages": [{"role": "user", "content": "Hello"}], + "max_tokens": 100, + "enable_thinking": false + }' + +**Recommended Sampling Parameters:** + +.. list-table:: + :header-rows: 1 + :widths: 30 35 35 + + * - Parameter + - Thinking Mode + - Non-Thinking Mode + * - temperature + - 0.6 + - 0.7 + * - top_p + - 0.95 + - 0.8 + * - top_k + - 20 + - 20 + * - presence_penalty + - 0.0 + - 1.5 + + +Testing and Validation +---------------------- + +Basic Functionality Testing +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: bash + + curl http://localhost:8000/generate \ + -H "Content-Type: application/json" \ + -d '{ + "inputs": "What is AI?", + "parameters":{ + "max_new_tokens": 100, + "frequency_penalty": 1 + } + }' + +OpenAI-Compatible Chat Completions +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: bash + + curl http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "Qwen3.5-397B-A17B", + "messages": [{"role": "user", "content": "Hello"}], + "max_tokens": 100, + "temperature": 0.7, + "top_p": 0.8, + "enable_thinking": false + }' + +Multimodal Testing (Image Input) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: bash + + curl http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "Qwen3.5-397B-A17B", + "messages": [ + { + "role": "user", + "content": [ + {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}, + {"type": "text", "text": "Describe this image."} + ] + } + ], + "max_tokens": 200 + }' + +Hardware Requirements +--------------------- + +**Qwen3.5-397B-A17B:** + +- 397B total parameters, 17B activated per token (512 experts, 10 routed + 1 shared) +- **Minimum**: 8× NVIDIA H100/H200 GPUs (80GB HBM each) with NVLink interconnect +- ``--tp 8`` required to fit model weights across GPUs +- Reduce ``--max_req_total_len`` or ``--graph_max_batch_size`` if encountering OOM errors +- Use ``--data_type fp8_e4m3`` for FP8 KV quantization to further reduce memory pressure diff --git a/docs/EN/source/index.rst b/docs/EN/source/index.rst index f2cfb4a8c8..808f432892 100755 --- a/docs/EN/source/index.rst +++ b/docs/EN/source/index.rst @@ -63,6 +63,7 @@ Documentation List :caption: Cookbook GLM-4.7-Flash Deployment + Qwen3.5 Deployment .. toctree:: :maxdepth: 1