Skip to content

[ascend] fix prefix caching#4448

Draft
yao-fengchen wants to merge 3 commits intoInternLM:mainfrom
DeepLink-org:prefix_caching
Draft

[ascend] fix prefix caching#4448
yao-fengchen wants to merge 3 commits intoInternLM:mainfrom
DeepLink-org:prefix_caching

Conversation

@yao-fengchen
Copy link
Copy Markdown
Collaborator

No description provided.

@@ -389,11 +385,11 @@ def get_moe_group_name(group):
q_seqlens=q_seqlens_cpu,
# kv_seqlens_expanded is only expanded in paged prefill,
# otherwise it equals kv_seqlens_cpu
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove these unused comments

return q_seqlens_cpu
return q_seqlens_cpu.cumsum(dim=0)

def get_kv_start_indices_and_attention_mask(is_decoding, is_prefill_no_cache, q_seqlens_list, kv_seqlens_list,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q_seqlens_list, kv_seqlens_list are not used

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates DLInfer paged-attention plumbing to fix prefix caching behavior (notably for Ascend), primarily by renaming the “unpaged prefill” flag to a clearer “prefill w/o cache” concept and by passing q_seqlens into the token-attention kernel path.

Changes:

  • Rename is_unpaged_prefillis_prefill_no_cache across DLInfer attention metadata and backends.
  • Pass q_seqlens through the paged token-attention wrapper into the underlying ext op.
  • Adjust Ascend backend sequence-length/mask preparation logic used for prefix/paged prefill.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
lmdeploy/pytorch/kernels/dlinfer/pagedattention.py Renames prefill flag and threads q_seqlens into token-attention execution.
lmdeploy/pytorch/backends/dlinfer/maca/op_backend.py Updates metadata flag name for prefill/caching behavior.
lmdeploy/pytorch/backends/dlinfer/camb/op_backend.py Updates metadata flag name and related conditional logic.
lmdeploy/pytorch/backends/dlinfer/attention.py Renames metadata field and forwards it into the kernel call.
lmdeploy/pytorch/backends/dlinfer/ascend/op_backend.py Reworks Ascend prefill/prefix-caching preparation (seqlens + attention mask), and updates flag naming.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +252 to +253
torch.triu(torch.ones(2048, 2048, dtype=torch.bool, device=step_context.block_offsets.device),
diagonal=1))
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The attention mask is now hard-coded to a 2048x2048 tensor, which can produce incorrect masking (or shape mismatches) whenever max_q_seq_len / max_kv_seq_len differ from 2048. This should be derived from the actual computed lengths (e.g., max_q_seq_len, max_kv_seq_len, or step_context.block_offsets.shape[1] * block_size) so the mask matches the kernel’s expected dimensions.

Suggested change
torch.triu(torch.ones(2048, 2048, dtype=torch.bool, device=step_context.block_offsets.device),
diagonal=1))
torch.triu(
torch.ones(max_q_seq_len,
max_kv_seq_len,
dtype=torch.bool,
device=step_context.block_offsets.device),
diagonal=1))

Copilot uses AI. Check for mistakes.
Comment on lines 386 to 387
# kv_seqlens_expanded is only expanded in paged prefill,
# otherwise it equals kv_seqlens_cpu
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment still refers to kv_seqlens_expanded, but the code no longer computes or passes an expanded KV seqlen tensor. Please update/remove this comment to reflect the new behavior so future changes don’t rely on outdated assumptions.

Suggested change
# kv_seqlens_expanded is only expanded in paged prefill,
# otherwise it equals kv_seqlens_cpu
# kv_seqlens is provided as kv_seqlens_cpu

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants