[Qwen3.5] dedup position ids#2102
Conversation
There was a problem hiding this comment.
Pull request overview
This PR reduces ONNX graph duplication in the Qwen3.5 builder by centralizing synthetic position_ids generation and deduplicating the resulting subgraph across all layers/attention paths, helping shrink graphs and avoid repeated INT64 ops.
Changes:
- Extracted
_make_synthetic_position_ids()to derivebatch_size/sequence_lengthfrom theposition_idsmodel input shape[3, B, S]. - Switched mRoPE rotation to reuse a shared synthetic
position_idssubgraph via a fixed basename (leveraging name-based node dedup). - Removed per-Q/K (and per-layer) synthetic
position_idssubgraph construction from_apply_mrope_rotation().
cd1a67c to
c25aff1
Compare
Performance Comparison - Prefill-1024, Decode-128
|
There are some styling changes that can be made.
|
Lgtm after Kunal's and Copilot changes are done |
c25aff1 to
e98cb96
Compare
|
Thanks. |
|
When I run |
Extract _make_synthetic_position_ids() to derive batch_size and sequence_length from the position_ids model input [3, B, S] instead of Shape on intermediate Q/K tensors. Use a fixed basename so make_node dedup creates the subgraph once and reuses it across all layers and Q/K calls. - Reduces ONNX graph size by eliminating per-layer duplicates - Eliminates INT64 Mul by using Reshape([3, -1]) to infer B*S implicitly, avoiding CPU fallback on WebGPU (WebGPU Mul does not support INT64)
e98cb96 to
c298862
Compare
OKAY. Rebasing against main cleared the lint errors. |
|
@kunal-vaishnavi |
Extract _make_synthetic_position_ids() to derive
batch_sizeandsequence_lengthfrom the position_ids model input [3, B, S] instead of Shape on intermediate Q/K tensors. Use a fixed basename so make_node dedup creates the subgraph once and reuses it across all layers and Q/K calls.Reduces ONNX graph size by eliminating per-layer duplicates
Eliminates redundant INT64 Mul nodes that cause CPU fallback on WebGPU (WebGPU Mul does not support INT64)