Skip to content

feat: implement Turbomind vision encoder support for Qwen3VL/3.5 families#4460

Open
lapy wants to merge 1 commit intoInternLM:mainfrom
lapy:split/qwen-vision-backend
Open

feat: implement Turbomind vision encoder support for Qwen3VL/3.5 families#4460
lapy wants to merge 1 commit intoInternLM:mainfrom
lapy:split/qwen-vision-backend

Conversation

@lapy
Copy link
Copy Markdown
Contributor

@lapy lapy commented Mar 24, 2026

Add vision support for Turbomind and Qwen3 VL and 3.5 vision encoders.

PR Testing

Scope

This change was validated on the TurboMind path only. The PyTorch backend was not modified or tested.

Regression Test

Ran the TurboMind-focused regression test:

PYTHONPATH=/root/lmdeploy-prs/split-qwen-vision-backend \
pytest -q tests/test_lmdeploy/test_vl/test_qwen_vl_family.py

Result:

8 passed

Optional Qwen3.5 Fast Path Dependency

Installed the matching prebuilt causal-conv1d wheel for this environment:

python -m pip install --break-system-packages \
  'https://github.com/Dao-AILab/causal-conv1d/releases/download/v1.6.1.post4/causal_conv1d-1.6.1%2Bcu12torch2.10cxx11abiTRUE-cp312-cp312-linux_x86_64.whl'

Verified the optional Qwen3.5 fast path is available after install:

python - <<'PY'
from transformers.models.qwen3_5 import modeling_qwen3_5 as m
print('is_fast_path_available', m.is_fast_path_available)
print('causal_conv1d_fn', m.causal_conv1d_fn is not None)
print('causal_conv1d_update', m.causal_conv1d_update is not None)
print('chunk_gated_delta_rule', m.chunk_gated_delta_rule is not None)
print('fused_recurrent_gated_delta_rule', m.fused_recurrent_gated_delta_rule is not None)
PY

End-to-End TurboMind Inference

Validated image inference with the smallest Qwen3-VL model:

CUDA_VISIBLE_DEVICES=0 \
PYTHONPATH=/root/lmdeploy-prs/split-qwen-vision-backend:/root/lmdeploy/lmdeploy/lib \
TM_LOG_LEVEL=ERROR \
python - <<'PY'
from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
from lmdeploy.vl import load_image

model = 'Qwen/Qwen3-VL-2B-Instruct'
backend_config = TurbomindEngineConfig(
    tp=1,
    max_batch_size=1,
    session_len=4096,
    cache_max_entry_count=0.05,
)
gen_config = GenerationConfig(max_new_tokens=48, do_sample=False, temperature=0.0)

img = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
pipe = pipeline(model, backend_config=backend_config, log_level='ERROR')
print('backend', pipe.async_engine.backend)
out = pipe(('Describe the image in one sentence.', img), gen_config=gen_config)
print(out.text)
pipe.close()
PY

Observed:

backend turbomind
A majestic tiger with a striking orange coat and black stripes rests peacefully on a vibrant green lawn, its gaze fixed directly at the camera.

Validated image inference with the smallest Qwen3.5 VL model:

CUDA_VISIBLE_DEVICES=0 \
PYTHONPATH=/root/lmdeploy-prs/split-qwen-vision-backend:/root/lmdeploy/lmdeploy/lib \
TM_LOG_LEVEL=ERROR \
python - <<'PY'
from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
from lmdeploy.vl import load_image

model = 'Qwen/Qwen3.5-0.8B'
backend_config = TurbomindEngineConfig(
    tp=1,
    max_batch_size=1,
    session_len=4096,
    cache_max_entry_count=0.05,
)
gen_config = GenerationConfig(max_new_tokens=48, do_sample=False, temperature=0.0)

img = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
pipe = pipeline(model, backend_config=backend_config, log_level='ERROR')
print('backend', pipe.async_engine.backend)
out = pipe(('Describe the image in one sentence.', img), gen_config=gen_config)
print(out.text)
pipe.close()
PY

Observed:

backend turbomind
A majestic tiger lies peacefully on a sunlit grassy field, its powerful eyes fixed forward and its powerful body relaxed.

@lapy lapy changed the title feat: implement Turbomind vision encoder support for Qwen3VL/3.5 fami… feat: implement Turbomind vision encoder support for Qwen3VL/3.5 families Mar 24, 2026
@lapy lapy force-pushed the split/qwen-vision-backend branch from b33b54e to 539e55c Compare March 24, 2026 22:11
@lapy lapy force-pushed the split/qwen-vision-backend branch from f177f41 to 13f0ae2 Compare March 24, 2026 23:14
@lvhan028
Copy link
Copy Markdown
Collaborator

@lapy you are on fire!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds TurboMind vision-encoder support for the Qwen3-VL and Qwen3.5 VL families by implementing split-vision loading/forwarding for TurboMind, updating model/arch routing, and extending export support for VL checkpoint key layouts.

Changes:

  • Implement TurboMind split-vision loading, vision forward, and mRoPE metadata packing for Qwen3-VL family (and reuse for Qwen3.5).
  • Update task routing (get_task) and call sites to use backend-config-aware VL/LLM engine selection, including a new disable_vision_encoder flag on TurbomindEngineConfig.
  • Extend TurboMind model/export support to recognize Qwen3-VL architectures and VL checkpoints whose text weights live under model.language_model.*, plus add regression tests.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
lmdeploy/vl/model/qwen3.py Adds dependency checks, HF arch→AutoModel mapping, split-vision loader, TurboMind forward/to_turbomind with mRoPE meta.
lmdeploy/vl/model/qwen3_5.py Reuses Qwen3-VL TurboMind vision path for Qwen3.5 and aligns preprocessor dep checks.
lmdeploy/archs.py Changes get_task signature and adds disable_vision_encoder routing behavior.
lmdeploy/pipeline.py Updates get_task call to new signature.
lmdeploy/serve/openai/api_server.py Updates get_task call to new signature.
lmdeploy/lite/apis/calibrate.py Updates get_task call to new signature.
lmdeploy/messages.py Adds disable_vision_encoder to TurbomindEngineConfig.
lmdeploy/cli/serve.py Wires CLI flag through to TurbomindEngineConfig.
lmdeploy/turbomind/supported_models.py Marks Qwen3-VL architectures as TurboMind-supported.
lmdeploy/turbomind/deploy/source_model/qwen.py Supports nested model.language_model.* prefixes and layer-pattern matching for VL checkpoints.
tests/test_lmdeploy/test_vl/test_qwen_vl_family.py Adds unit tests for arch resolution, task routing, and Qwen3-VL TurboMind packing behavior.
tests/test_lmdeploy/test_pytorch/test_engine_disable_vision.py Adds a regression test for expected behavior when vision is disabled in PyTorch engine.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 25 to +31
try:
from transformers import Qwen3VLForConditionalGeneration, Qwen3VLMoeForConditionalGeneration # noqa: F401
from transformers import ( # noqa: F401
Qwen3_5ForConditionalGeneration,
Qwen3_5MoeForConditionalGeneration,
Qwen3VLForConditionalGeneration,
Qwen3VLMoeForConditionalGeneration,
)
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_qwen3_vl_deps_install() currently requires both Qwen3-VL and Qwen3.5 classes to be importable from transformers. This can regress Qwen3-VL usage in environments where transformers has Qwen3-VL but not yet Qwen3.5 (or vice-versa). Consider checking only the architecture actually being loaded (e.g., based on self.hf_config.architectures[0]), or making the Qwen3.5 import optional unless a Qwen3.5 arch is detected.

Copilot uses AI. Check for mistakes.
Comment on lines +40 to +45
with patch('lmdeploy.pytorch.engine.engine.response_reqs', side_effect=capture_response):
engine._on_add_message([req])

assert len(captured) == 1
assert captured[0][0] == ResponseType.INTERNAL_ENGINE_ERROR
assert 'disable_vision_encoder=True' in captured[0][1]
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test expects disable_vision_encoder=True to reject multimodal inputs with an error, but the current Engine implementation (lmdeploy/pytorch/engine/engine.py:_on_add_message) sets input_multimodals=None and continues (warning only). Either update the engine behavior to match this test, or adjust the assertions to reflect the current contract; otherwise this test will fail consistently.

Copilot uses AI. Check for mistakes.
Comment on lines +138 to +139
if backend_config and backend_config.disable_vision_encoder:
return 'llm', AsyncEngine
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When disable_vision_encoder is set, get_task() routes VL architectures to the plain AsyncEngine. AsyncEngine constructs MultimodalProcessor without a vl_encoder, and MultimodalProcessor.get_prompt_input() will then treat multimodal messages as text-only and silently drop image/video blocks (it only joins type=='text'). If the intended contract is to reject multimodal inputs when vision is disabled, this needs an explicit error path (or a different engine selection) to avoid silent data loss.

Suggested change
if backend_config and backend_config.disable_vision_encoder:
return 'llm', AsyncEngine
if backend_config and getattr(backend_config, 'disable_vision_encoder', False):
raise ValueError(
'Invalid configuration: disable_vision_encoder is True for a vision-language '
'model. This would route the model to a text-only engine and silently drop '
'image/video inputs. Please use a pure language model or enable the vision '
'encoder.'
)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants