[ascend] fix prefix caching by yao-fengchen · Pull Request #4448 · InternLM/lmdeploy

yao-fengchen · 2026-03-23T06:38:52Z

No description provided.

jinminxi104 · 2026-04-02T08:28:12Z

lmdeploy/pytorch/backends/dlinfer/ascend/op_backend.py

@@ -389,11 +385,11 @@ def get_moe_group_name(group):
            q_seqlens=q_seqlens_cpu,
            # kv_seqlens_expanded is only expanded in paged prefill,
            # otherwise it equals kv_seqlens_cpu


remove these unused comments

jinminxi104 · 2026-04-02T12:05:04Z

lmdeploy/pytorch/backends/dlinfer/ascend/op_backend.py

+                return q_seqlens_cpu
+            return q_seqlens_cpu.cumsum(dim=0)
+
+        def get_kv_start_indices_and_attention_mask(is_decoding, is_prefill_no_cache, q_seqlens_list, kv_seqlens_list,


q_seqlens_list, kv_seqlens_list are not used

Copilot

Pull request overview

This PR updates DLInfer paged-attention plumbing to fix prefix caching behavior (notably for Ascend), primarily by renaming the “unpaged prefill” flag to a clearer “prefill w/o cache” concept and by passing q_seqlens into the token-attention kernel path.

Changes:

Rename is_unpaged_prefill → is_prefill_no_cache across DLInfer attention metadata and backends.
Pass q_seqlens through the paged token-attention wrapper into the underlying ext op.
Adjust Ascend backend sequence-length/mask preparation logic used for prefix/paged prefill.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
lmdeploy/pytorch/kernels/dlinfer/pagedattention.py	Renames prefill flag and threads `q_seqlens` into token-attention execution.
lmdeploy/pytorch/backends/dlinfer/maca/op_backend.py	Updates metadata flag name for prefill/caching behavior.
lmdeploy/pytorch/backends/dlinfer/camb/op_backend.py	Updates metadata flag name and related conditional logic.
lmdeploy/pytorch/backends/dlinfer/attention.py	Renames metadata field and forwards it into the kernel call.
lmdeploy/pytorch/backends/dlinfer/ascend/op_backend.py	Reworks Ascend prefill/prefix-caching preparation (seqlens + attention mask), and updates flag naming.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-02T12:11:59Z

lmdeploy/pytorch/backends/dlinfer/ascend/op_backend.py

+                        torch.triu(torch.ones(2048, 2048, dtype=torch.bool, device=step_context.block_offsets.device),
+                                   diagonal=1))


The attention mask is now hard-coded to a 2048x2048 tensor, which can produce incorrect masking (or shape mismatches) whenever max_q_seq_len / max_kv_seq_len differ from 2048. This should be derived from the actual computed lengths (e.g., max_q_seq_len, max_kv_seq_len, or step_context.block_offsets.shape[1] * block_size) so the mask matches the kernel’s expected dimensions.

Suggested change

torch.triu(torch.ones(2048, 2048, dtype=torch.bool, device=step_context.block_offsets.device),

diagonal=1))

torch.triu(

torch.ones(max_q_seq_len,

max_kv_seq_len,

dtype=torch.bool,

device=step_context.block_offsets.device),

diagonal=1))

Copilot · 2026-04-02T12:11:59Z

lmdeploy/pytorch/backends/dlinfer/ascend/op_backend.py

            # kv_seqlens_expanded is only expanded in paged prefill,
            # otherwise it equals kv_seqlens_cpu


The comment still refers to kv_seqlens_expanded, but the code no longer computes or passes an expanded KV seqlen tensor. Please update/remove this comment to reflect the new behavior so future changes don’t rely on outdated assumptions.

Suggested change

# kv_seqlens_expanded is only expanded in paged prefill,

# otherwise it equals kv_seqlens_cpu

# kv_seqlens is provided as kv_seqlens_cpu

yao-fengchen and others added 3 commits March 23, 2026 02:41

fix prefix caching

44b7c2b

change attention layout from BSH to TND

16d847e

Merge branch 'main' into prefix_caching

85cc330

jinminxi104 requested changes Apr 2, 2026

View reviewed changes

jinminxi104 requested a review from Copilot April 2, 2026 12:06

Copilot started reviewing on behalf of jinminxi104 April 2, 2026 12:07 View session

Copilot AI reviewed Apr 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ascend] fix prefix caching#4448

[ascend] fix prefix caching#4448
yao-fengchen wants to merge 3 commits intoInternLM:mainfrom
DeepLink-org:prefix_caching

yao-fengchen commented Mar 23, 2026

Uh oh!

jinminxi104 Apr 2, 2026

Uh oh!

jinminxi104 Apr 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 2, 2026

Uh oh!

Copilot AI Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		torch.triu(torch.ones(2048, 2048, dtype=torch.bool, device=step_context.block_offsets.device),
		diagonal=1))

		# kv_seqlens_expanded is only expanded in paged prefill,
		# otherwise it equals kv_seqlens_cpu

	# kv_seqlens_expanded is only expanded in paged prefill,
	# otherwise it equals kv_seqlens_cpu
	# kv_seqlens is provided as kv_seqlens_cpu

Conversation

yao-fengchen commented Mar 23, 2026

Uh oh!

jinminxi104 Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

jinminxi104 Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants