feat: implement Turbomind vision encoder support for Qwen3VL/3.5 families#4460
feat: implement Turbomind vision encoder support for Qwen3VL/3.5 families#4460lapy wants to merge 1 commit intoInternLM:mainfrom
Conversation
b33b54e to
539e55c
Compare
f177f41 to
13f0ae2
Compare
|
@lapy you are on fire! |
There was a problem hiding this comment.
Pull request overview
This PR adds TurboMind vision-encoder support for the Qwen3-VL and Qwen3.5 VL families by implementing split-vision loading/forwarding for TurboMind, updating model/arch routing, and extending export support for VL checkpoint key layouts.
Changes:
- Implement TurboMind split-vision loading, vision forward, and mRoPE metadata packing for Qwen3-VL family (and reuse for Qwen3.5).
- Update task routing (
get_task) and call sites to use backend-config-aware VL/LLM engine selection, including a newdisable_vision_encoderflag onTurbomindEngineConfig. - Extend TurboMind model/export support to recognize Qwen3-VL architectures and VL checkpoints whose text weights live under
model.language_model.*, plus add regression tests.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
lmdeploy/vl/model/qwen3.py |
Adds dependency checks, HF arch→AutoModel mapping, split-vision loader, TurboMind forward/to_turbomind with mRoPE meta. |
lmdeploy/vl/model/qwen3_5.py |
Reuses Qwen3-VL TurboMind vision path for Qwen3.5 and aligns preprocessor dep checks. |
lmdeploy/archs.py |
Changes get_task signature and adds disable_vision_encoder routing behavior. |
lmdeploy/pipeline.py |
Updates get_task call to new signature. |
lmdeploy/serve/openai/api_server.py |
Updates get_task call to new signature. |
lmdeploy/lite/apis/calibrate.py |
Updates get_task call to new signature. |
lmdeploy/messages.py |
Adds disable_vision_encoder to TurbomindEngineConfig. |
lmdeploy/cli/serve.py |
Wires CLI flag through to TurbomindEngineConfig. |
lmdeploy/turbomind/supported_models.py |
Marks Qwen3-VL architectures as TurboMind-supported. |
lmdeploy/turbomind/deploy/source_model/qwen.py |
Supports nested model.language_model.* prefixes and layer-pattern matching for VL checkpoints. |
tests/test_lmdeploy/test_vl/test_qwen_vl_family.py |
Adds unit tests for arch resolution, task routing, and Qwen3-VL TurboMind packing behavior. |
tests/test_lmdeploy/test_pytorch/test_engine_disable_vision.py |
Adds a regression test for expected behavior when vision is disabled in PyTorch engine. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| try: | ||
| from transformers import Qwen3VLForConditionalGeneration, Qwen3VLMoeForConditionalGeneration # noqa: F401 | ||
| from transformers import ( # noqa: F401 | ||
| Qwen3_5ForConditionalGeneration, | ||
| Qwen3_5MoeForConditionalGeneration, | ||
| Qwen3VLForConditionalGeneration, | ||
| Qwen3VLMoeForConditionalGeneration, | ||
| ) |
There was a problem hiding this comment.
check_qwen3_vl_deps_install() currently requires both Qwen3-VL and Qwen3.5 classes to be importable from transformers. This can regress Qwen3-VL usage in environments where transformers has Qwen3-VL but not yet Qwen3.5 (or vice-versa). Consider checking only the architecture actually being loaded (e.g., based on self.hf_config.architectures[0]), or making the Qwen3.5 import optional unless a Qwen3.5 arch is detected.
| with patch('lmdeploy.pytorch.engine.engine.response_reqs', side_effect=capture_response): | ||
| engine._on_add_message([req]) | ||
|
|
||
| assert len(captured) == 1 | ||
| assert captured[0][0] == ResponseType.INTERNAL_ENGINE_ERROR | ||
| assert 'disable_vision_encoder=True' in captured[0][1] |
There was a problem hiding this comment.
This test expects disable_vision_encoder=True to reject multimodal inputs with an error, but the current Engine implementation (lmdeploy/pytorch/engine/engine.py:_on_add_message) sets input_multimodals=None and continues (warning only). Either update the engine behavior to match this test, or adjust the assertions to reflect the current contract; otherwise this test will fail consistently.
| if backend_config and backend_config.disable_vision_encoder: | ||
| return 'llm', AsyncEngine |
There was a problem hiding this comment.
When disable_vision_encoder is set, get_task() routes VL architectures to the plain AsyncEngine. AsyncEngine constructs MultimodalProcessor without a vl_encoder, and MultimodalProcessor.get_prompt_input() will then treat multimodal messages as text-only and silently drop image/video blocks (it only joins type=='text'). If the intended contract is to reject multimodal inputs when vision is disabled, this needs an explicit error path (or a different engine selection) to avoid silent data loss.
| if backend_config and backend_config.disable_vision_encoder: | |
| return 'llm', AsyncEngine | |
| if backend_config and getattr(backend_config, 'disable_vision_encoder', False): | |
| raise ValueError( | |
| 'Invalid configuration: disable_vision_encoder is True for a vision-language ' | |
| 'model. This would route the model to a text-only engine and silently drop ' | |
| 'image/video inputs. Please use a pure language model or enable the vision ' | |
| 'encoder.' | |
| ) |
Add vision support for Turbomind and Qwen3 VL and 3.5 vision encoders.
PR Testing
Scope
This change was validated on the TurboMind path only. The PyTorch backend was not modified or tested.
Regression Test
Ran the TurboMind-focused regression test:
Result:
Optional Qwen3.5 Fast Path Dependency
Installed the matching prebuilt
causal-conv1dwheel for this environment:python -m pip install --break-system-packages \ 'https://github.com/Dao-AILab/causal-conv1d/releases/download/v1.6.1.post4/causal_conv1d-1.6.1%2Bcu12torch2.10cxx11abiTRUE-cp312-cp312-linux_x86_64.whl'Verified the optional Qwen3.5 fast path is available after install:
End-to-End TurboMind Inference
Validated image inference with the smallest Qwen3-VL model:
Observed:
Validated image inference with the smallest Qwen3.5 VL model:
Observed: