Conversation
| @@ -389,11 +385,11 @@ def get_moe_group_name(group): | |||
| q_seqlens=q_seqlens_cpu, | |||
| # kv_seqlens_expanded is only expanded in paged prefill, | |||
| # otherwise it equals kv_seqlens_cpu | |||
There was a problem hiding this comment.
remove these unused comments
| return q_seqlens_cpu | ||
| return q_seqlens_cpu.cumsum(dim=0) | ||
|
|
||
| def get_kv_start_indices_and_attention_mask(is_decoding, is_prefill_no_cache, q_seqlens_list, kv_seqlens_list, |
There was a problem hiding this comment.
q_seqlens_list, kv_seqlens_list are not used
There was a problem hiding this comment.
Pull request overview
This PR updates DLInfer paged-attention plumbing to fix prefix caching behavior (notably for Ascend), primarily by renaming the “unpaged prefill” flag to a clearer “prefill w/o cache” concept and by passing q_seqlens into the token-attention kernel path.
Changes:
- Rename
is_unpaged_prefill→is_prefill_no_cacheacross DLInfer attention metadata and backends. - Pass
q_seqlensthrough the paged token-attention wrapper into the underlying ext op. - Adjust Ascend backend sequence-length/mask preparation logic used for prefix/paged prefill.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| lmdeploy/pytorch/kernels/dlinfer/pagedattention.py | Renames prefill flag and threads q_seqlens into token-attention execution. |
| lmdeploy/pytorch/backends/dlinfer/maca/op_backend.py | Updates metadata flag name for prefill/caching behavior. |
| lmdeploy/pytorch/backends/dlinfer/camb/op_backend.py | Updates metadata flag name and related conditional logic. |
| lmdeploy/pytorch/backends/dlinfer/attention.py | Renames metadata field and forwards it into the kernel call. |
| lmdeploy/pytorch/backends/dlinfer/ascend/op_backend.py | Reworks Ascend prefill/prefix-caching preparation (seqlens + attention mask), and updates flag naming. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| torch.triu(torch.ones(2048, 2048, dtype=torch.bool, device=step_context.block_offsets.device), | ||
| diagonal=1)) |
There was a problem hiding this comment.
The attention mask is now hard-coded to a 2048x2048 tensor, which can produce incorrect masking (or shape mismatches) whenever max_q_seq_len / max_kv_seq_len differ from 2048. This should be derived from the actual computed lengths (e.g., max_q_seq_len, max_kv_seq_len, or step_context.block_offsets.shape[1] * block_size) so the mask matches the kernel’s expected dimensions.
| torch.triu(torch.ones(2048, 2048, dtype=torch.bool, device=step_context.block_offsets.device), | |
| diagonal=1)) | |
| torch.triu( | |
| torch.ones(max_q_seq_len, | |
| max_kv_seq_len, | |
| dtype=torch.bool, | |
| device=step_context.block_offsets.device), | |
| diagonal=1)) |
| # kv_seqlens_expanded is only expanded in paged prefill, | ||
| # otherwise it equals kv_seqlens_cpu |
There was a problem hiding this comment.
The comment still refers to kv_seqlens_expanded, but the code no longer computes or passes an expanded KV seqlen tensor. Please update/remove this comment to reflect the new behavior so future changes don’t rely on outdated assumptions.
| # kv_seqlens_expanded is only expanded in paged prefill, | |
| # otherwise it equals kv_seqlens_cpu | |
| # kv_seqlens is provided as kv_seqlens_cpu |
No description provided.