Skip to content

[WIP]: fix#4076

Closed
asddongmen wants to merge 11 commits intopingcap:masterfrom
asddongmen:0128-investiate-puller-stuck
Closed

[WIP]: fix#4076
asddongmen wants to merge 11 commits intopingcap:masterfrom
asddongmen:0128-investiate-puller-stuck

Conversation

@asddongmen
Copy link
Collaborator

@asddongmen asddongmen commented Jan 28, 2026

What problem does this PR solve?

Issue Number: close #xxx

What is changed and how it works?

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

Summary by CodeRabbit

  • Chores

    • More efficient resolved-timestamp computation and caching of minimum timestamps for unlocked ranges.
    • Added optional per-table testing/instrumentation to simulate locking behavior and gather periodic stats.
  • Tests

    • Updated tests for the new per-table behavior and added benchmarks to compare timestamp/heap approaches across region mixes.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: dongmen <414110582@qq.com>
@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Jan 28, 2026
@asddongmen
Copy link
Collaborator Author

/test all

@ti-chi-bot ti-chi-bot bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 28, 2026
@gemini-code-assist
Copy link

Summary of Changes

Hello @asddongmen, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on enhancing the performance of the RangeLock mechanism within the logpuller service by optimizing how the minimum resolved timestamp is calculated. The changes introduce a caching strategy for unlocked ranges and utilize a min-heap for locked ranges, drastically reducing the computational overhead associated with frequent queries for the global minimum resolved timestamp. These improvements are critical for maintaining the efficiency and responsiveness of the log service.

Highlights

  • Performance Optimization for Resolved Timestamp Calculation: The ResolvedTs method within the RangeLock component has been significantly optimized to improve efficiency in determining the minimum resolved timestamp across all ranges.
  • Caching of Unlocked Range Minimum Timestamp: A new field, unlockedRangesMinTs, was introduced to cache the minimum resolved timestamp of unlocked ranges, thereby eliminating the need for repeated, potentially expensive, recalculations.
  • Efficient Locked Range Minimum Timestamp Retrieval: The ResolvedTs method now leverages lockedRangeStateHeap.PeekTop() for O(1) retrieval of the minimum resolved timestamp from locked ranges, replacing a less efficient iteration over a btree.
  • Refactored ResolvedTs Update Logic: The logic in region_event_handler.go for updating the RangeLock's state and retrieving the resolved timestamp has been adjusted to utilize the newly optimized RangeLock.ResolvedTs() method.
  • New Performance Benchmarks: Dedicated benchmark tests (BenchmarkGetHeapMinTs and BenchmarkGetHeapMinTsVsResolvedTs) have been added to validate and compare the performance improvements of the new resolved timestamp calculation methods.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to optimize resolved timestamp calculation in the log puller using caching and a min-heap. While logservice/logpuller/regionlock/region_range_lock.go changes are sound, a critical high-severity logic error exists in logservice/logpuller/region_event_handler.go. The min-heap update logic was incorrectly placed within a throttled block, leading to potential heap inconsistency. This can cause the system to report an incorrect resolved timestamp, potentially resulting in silent data loss in CDC systems like TiCDC. It is crucial to ensure the heap is updated consistently whenever a region's resolved timestamp changes to maintain data integrity.

Signed-off-by: dongmen <414110582@qq.com>
Signed-off-by: dongmen <414110582@qq.com>
Signed-off-by: dongmen <414110582@qq.com>
@ti-chi-bot
Copy link

ti-chi-bot bot commented Jan 28, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign hongyunyan for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 28, 2026

📝 Walkthrough

Walkthrough

Adds a cached minimum resolved timestamp for unlocked ranges, introduces a tableID parameter to RangeLock, implements a hack/testing mode with a lock cap and background stats loop, updates call sites and tests to pass tableID, and adds benchmarks and helpers; no core control-flow changes outside range lock internals.

Changes

Cohort / File(s) Summary
Range lock implementation
logservice/logpuller/regionlock/region_range_lock.go
Added unlockedRangesMinTs and hackLimitLock; NewRangeLock gains tableID param and initializes unlockedRangesMinTs; ResolvedTs() now combines heap top (PeekTop) and cached unlocked min; added hack-mode logic, logStatsLoop, and minor unlock/lock-path adjustments.
Range lock helpers
logservice/logpuller/regionlock/range_ts_map.go
Added hackMode flag, countRanges(), hackSplitInto(n, ts) and isHackMode() helpers to support testing/hack behavior and key-splitting.
Tests & benchmarks
logservice/logpuller/regionlock/region_range_lock_test.go
Updated constructor calls to include new param; added setupRangeLockForBenchmark, BenchmarkGetHeapMinTs, BenchmarkGetHeapMinTsVsResolvedTs; tests refresh heap via UpdateLockedRangeStateHeap(...).
Call sites / event handling
logservice/logpuller/region_event_handler.go, logservice/logpuller/region_event_handler_test.go, logservice/logpuller/subscription_client.go
Updated NewRangeLock call sites to pass tableID; handleResolvedTs() includes new comments about potential large-region performance impacts and uses span.rangeLock.ResolvedTs().
Manifest / module
manifest_file, go.mod
Supporting manifest and module updates included in the diff.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I peek the heap and tuck a min away,

A table tag for hops, a tiny play,
I split the ground for tests and count the day,
Stats hum soft while unlocked whispers stay.

🚥 Pre-merge checks | ❌ 3
❌ Failed checks (2 warnings, 1 inconclusive)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is incomplete; the template is only filled with placeholders (Issue Number: close #xxx, empty 'What is changed' section, unanswered questions, and no release notes). Fill in the actual issue number, describe the changes to RangeLock and range_ts_map, answer performance/compatibility questions, and provide a release note.
Docstring Coverage ⚠️ Warning Docstring coverage is 21.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title '[WIP]: fix' is too vague and generic; it provides no meaningful information about the actual changes (range lock optimization, unlockedRangesMinTs caching, hack mode for table 930). Replace with a descriptive title summarizing the main change, e.g., 'Add caching for unlocked range timestamps in RangeLock' or similar.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: dongmen <414110582@qq.com>
Signed-off-by: dongmen <414110582@qq.com>
Signed-off-by: dongmen <414110582@qq.com>
Signed-off-by: dongmen <414110582@qq.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@logservice/logpuller/regionlock/region_range_lock.go`:
- Around line 491-496: The created buffered channel (ch) returned when
l.hackLimitLock is reached is never signaled/closed, causing callers of
LockRange's WaitFn to block; modify the hack path in region_range_lock.go (the
block that checks l.hackLimitLock and lockedRanges) to return a channel that is
closed (or otherwise signaled) so waiters unblock—e.g., create ch := make(chan
interface{},1); close(ch) before returning the []<-chan interface{} so
LockRangeResult{Status: LockRangeStatusWait} consumers and the WaitFn (the wait
logic used by LockRange) do not leak goroutines.
- Around line 138-140: Remove the WIP debug code: delete the hackLimitLock field
from the struct and remove all code paths that reference it (the hardcoded table
ID 930 checks, the logic that force-sets the limit to 2/30, and the goroutine
that logs every 30s). If this behavior must be preserved for testing, replace it
with a proper, documented configuration flag (e.g., DebugLimitLock or
LimitLockConfig) passed into RegionRangeLock via constructor/config instead of
hardcoded values; otherwise just eliminate the field and associated
branches/goroutine so no magic IDs or periodic test logging remain.
🧹 Nitpick comments (1)
logservice/logpuller/regionlock/region_range_lock.go (1)

307-320: Good optimization, but inconsistent with GetHeapMinTs().

The ResolvedTs() method now uses the cached unlockedRangesMinTs for O(1) lookup instead of scanning all unlocked ranges. However, GetHeapMinTs() (lines 470-472) still calls l.unlockedRanges.getMinTs() directly:

// GetHeapMinTs (line 470)
unlockedMinTs := l.unlockedRanges.getMinTs()

Consider updating GetHeapMinTs() to also use the cached value for consistency, or document why the difference is intentional.

Comment on lines +138 to +140

// HACK: for testing table 930, limit locked ranges to 30
hackLimitLock int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove debug/hack code before merging.

The hackLimitLock field and associated logic (lines 159-164, 491-496, 169-192) are clearly WIP/debug code:

  • Hardcoded magic table ID 930
  • Comment says "limit locked ranges to 30" but code sets it to 2
  • Starts a goroutine that logs every 30 seconds

This should either be removed or converted to a proper testing/debugging feature with configuration flags rather than hardcoded values.

🤖 Prompt for AI Agents
In `@logservice/logpuller/regionlock/region_range_lock.go` around lines 138 - 140,
Remove the WIP debug code: delete the hackLimitLock field from the struct and
remove all code paths that reference it (the hardcoded table ID 930 checks, the
logic that force-sets the limit to 2/30, and the goroutine that logs every 30s).
If this behavior must be preserved for testing, replace it with a proper,
documented configuration flag (e.g., DebugLimitLock or LimitLockConfig) passed
into RegionRangeLock via constructor/config instead of hardcoded values;
otherwise just eliminate the field and associated branches/goroutine so no magic
IDs or periodic test logging remain.

Comment on lines +491 to +496
// HACK: for testing table 930, limit locked ranges to hackLimitLock
if l.hackLimitLock > 0 && l.lockedRanges.Len() >= l.hackLimitLock {
ch := make(chan interface{}, 1)
// Return wait status so it keeps retrying but never succeeds
return LockRangeResult{Status: LockRangeStatusWait}, []<-chan interface{}{ch}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Goroutine leak: channel is never signaled.

This hack creates a channel that is never signaled or closed. When callers invoke WaitFn() (see lines 205-221 in LockRange), they will block forever waiting on this channel unless the context is cancelled.

ch := make(chan interface{}, 1)
// ch is never written to or closed
return LockRangeResult{Status: LockRangeStatusWait}, []<-chan interface{}{ch}

This causes goroutine leaks for any lock attempt when hackLimitLock is reached. Even for debugging purposes, this should at least close the channel or signal it periodically.

🤖 Prompt for AI Agents
In `@logservice/logpuller/regionlock/region_range_lock.go` around lines 491 - 496,
The created buffered channel (ch) returned when l.hackLimitLock is reached is
never signaled/closed, causing callers of LockRange's WaitFn to block; modify
the hack path in region_range_lock.go (the block that checks l.hackLimitLock and
lockedRanges) to return a channel that is closed (or otherwise signaled) so
waiters unblock—e.g., create ch := make(chan interface{},1); close(ch) before
returning the []<-chan interface{} so LockRangeResult{Status:
LockRangeStatusWait} consumers and the WaitFn (the wait logic used by LockRange)
do not leak goroutines.

Signed-off-by: dongmen <414110582@qq.com>
Signed-off-by: dongmen <414110582@qq.com>
@ti-chi-bot ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 28, 2026
Signed-off-by: dongmen <414110582@qq.com>
@ti-chi-bot
Copy link

ti-chi-bot bot commented Jan 28, 2026

[FORMAT CHECKER NOTIFICATION]

Notice: To remove the do-not-merge/needs-linked-issue label, please provide the linked issue number on one line in the PR body, for example: Issue Number: close #123 or Issue Number: ref #456.

📖 For more info, you can check the "Contribute Code" section in the development guide.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
logservice/logpuller/regionlock/range_ts_map.go (1)

27-34: Preserve hackMode in clones

rangeTsMap.clone doesn’t copy the new hackMode flag, so a clone created after hackSplitInto will unexpectedly allow set/unset to mutate entries. Consider copying the flag to avoid surprising behavior in hack mode.

🧩 Proposed fix
func (m *rangeTsMap) clone() (res *rangeTsMap) {
	res = &rangeTsMap{
		m:     btree.NewG(16, rangeTsEntryLess),
		start: m.start,
		end:   m.end,
+		hackMode: m.hackMode,
	}
	m.m.Ascend(func(i rangeTsEntry) bool {
		res.m.ReplaceOrInsert(i)
		return true
	})
	return
}
🤖 Fix all issues with AI agents
In `@logservice/logpuller/regionlock/region_range_lock.go`:
- Around line 128-130: The hackSplitInto(...) call mutates the unlockedRanges
map but does not update the unlockedRangesMinTs cache; add an explicit cache
update after every hackSplitInto invocation to recompute/unify
unlockedRangesMinTs (same approach used in set() and unset()) so the cached
minimum resolved ts stays consistent with unlockedRanges; apply this fix in the
three places where hackSplitInto is called (the block around the first call, and
the other two blocks referenced) ensuring you hold mu while updating
unlockedRangesMinTs.
🧹 Nitpick comments (1)
logservice/logpuller/regionlock/range_ts_map.go (1)

234-250: Guard hackSplitInto against non-positive n

If n <= 0, the map is cleared and left empty, which can make later reads return MaxUint64. A small guard prevents accidental misuse.

🛡️ Suggested guard
func (m *rangeTsMap) hackSplitInto(n int, ts uint64) {
+	if n <= 0 {
+		log.Panic("hackSplitInto requires n > 0")
+	}
	// Clear existing entries
	m.m.Clear(false)

Comment on lines +128 to +130
// unlockedRangesMinTs caches the minimum resolvedTs among unlocked ranges.
// It is protected by `mu`.
unlockedRangesMinTs uint64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find all occurrences of unlockedRanges with context to understand mutation patterns
rg -n --type=go -B2 -A2 'unlockedRanges' logservice/logpuller/regionlock/region_range_lock.go

Repository: pingcap/ticdc

Length of output: 3417


🏁 Script executed:

#!/bin/bash
# Also check if unlockedRanges is used in other files in the same package or elsewhere
fd -e go . logservice/logpuller/regionlock/ | xargs rg -l 'unlockedRanges' 2>/dev/null

Repository: pingcap/ticdc

Length of output: 109


🏁 Script executed:

#!/bin/bash
# Get the full context of the file to understand the structure
wc -l logservice/logpuller/regionlock/region_range_lock.go

Repository: pingcap/ticdc

Length of output: 113


🏁 Script executed:

#!/bin/bash
# Find hackSplitInto implementation to understand if it changes min values
rg -n --type=go -B3 -A10 'func.*hackSplitInto'

Repository: pingcap/ticdc

Length of output: 1271


🏁 Script executed:

#!/bin/bash
# Check if there are any other methods that mutate unlockedRanges outside of this file
rg -n --type=go 'unlockedRanges\s*=' logservice/logpuller/regionlock/ --type=go

Repository: pingcap/ticdc

Length of output: 39


🏁 Script executed:

#!/bin/bash
# Search more broadly for any modifications to unlockedRanges (not just assignments)
rg -n --type=go '\bunlockedRanges\b' logservice/logpuller/regionlock/region_range_lock.go | head -30

Repository: pingcap/ticdc

Length of output: 1144


🏁 Script executed:

#!/bin/bash
# Get the full hackSplitInto implementation
rg -n --type=go -A30 'func \(m \*rangeTsMap\) hackSplitInto'

Repository: pingcap/ticdc

Length of output: 2530


🏁 Script executed:

#!/bin/bash
# Check if there's a getMinTs call on unlockedRanges anywhere else or if cache must be manually updated
rg -n --type=go 'getMinTs|getMin' logservice/logpuller/regionlock/range_ts_map.go | head -20

Repository: pingcap/ticdc

Length of output: 176


🏁 Script executed:

#!/bin/bash
# Let's verify that hackSplitInto is only called once in constructor
rg -n --type=go 'hackSplitInto' logservice/logpuller/regionlock/

Repository: pingcap/ticdc

Length of output: 386


🏁 Script executed:

#!/bin/bash
# Double-check: see the getMinTs implementation to ensure it's working correctly
rg -n --type=go -A15 'func \(m \*rangeTsMap\) getMinTs\(\)'

Repository: pingcap/ticdc

Length of output: 1212


Add cache update after hackSplitInto mutation

The hackSplitInto() call at line 162 mutates unlockedRanges but does not update unlockedRangesMinTs. While the cache value happens to be correct (all entries created have ts=startTs), this violates the synchronization principle: every path that mutates unlockedRanges should update the cache. Add an explicit cache update after hackSplitInto() to maintain consistency with the pattern used in set() and unset().

Also applies to: 289-293, 524-528

🤖 Prompt for AI Agents
In `@logservice/logpuller/regionlock/region_range_lock.go` around lines 128 - 130,
The hackSplitInto(...) call mutates the unlockedRanges map but does not update
the unlockedRangesMinTs cache; add an explicit cache update after every
hackSplitInto invocation to recompute/unify unlockedRangesMinTs (same approach
used in set() and unset()) so the cached minimum resolved ts stays consistent
with unlockedRanges; apply this fix in the three places where hackSplitInto is
called (the block around the first call, and the other two blocks referenced)
ensuring you hold mu while updating unlockedRangesMinTs.

@asddongmen asddongmen closed this Feb 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/needs-linked-issue do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant