Skip to content

[fix](streaming-job) streaming CDC reliability fixes (scheduling latch, fetch-meta reason, PG db name, flaky case)#64310

Open
JNSimba wants to merge 4 commits into
apache:masterfrom
JNSimba:fix_streaming_ontaskcommitted_latch
Open

[fix](streaming-job) streaming CDC reliability fixes (scheduling latch, fetch-meta reason, PG db name, flaky case)#64310
JNSimba wants to merge 4 commits into
apache:masterfrom
JNSimba:fix_streaming_ontaskcommitted_latch

Conversation

@JNSimba

@JNSimba JNSimba commented Jun 9, 2026

Copy link
Copy Markdown
Member

Proposed changes

Several independent reliability fixes for streaming CDC jobs.

  1. Drop the 0-row latch in JdbcSourceOffsetProvider.onTaskCommitted. Setting hasMoreData = false on a 0-row/0-byte commit was wrong: a 0-row task does not mean the source is caught up (e.g. a large upstream transaction is still buffered on the cdc_client side). The latch made the scheduler stop dispatching tasks and could freeze streaming. The interface default is a no-op, so the override is simply removed; hasMoreData is still advanced by fetchRemoteMeta() and hasMoreDataToConsume().

  2. Set the fetch-meta failure reason after pausing. fetchMeta() set the failure reason and then paused. A concurrent task-success callback (which clears the reason under the job write lock) could wipe the freshly-set reason, leaving the job PAUSED with an empty reason and auto-resume disabled. Pausing first orders the reason update after any in-flight success callback, so the reason survives.

  3. Reject an over-long PostgreSQL database name at CREATE JOB. PostgreSQL truncates a database name longer than its identifier limit, but the replication-slot lookup compares the full configured name, so an existing slot looks missing and the job then fails creating a duplicate slot. The connection itself still succeeds (PG truncates at connect time), so no connectivity check catches it. A new config streaming_pg_max_identifier_length (default 63, adjustable for a larger NAMEDATALEN build) gates a fail-fast check.

  4. Stabilize test_streaming_postgres_job_special_offset. PAUSE cancels the FE task but the in-flight cdc_client reader keeps polling (up to max_interval) and could stream-load rows inserted right after the pause, defeating the ALTER-offset reposition. The test now deterministically drains the lingering reader before inserting the before-mark rows.

@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@JNSimba JNSimba changed the title [fix](streaming-job) drop the 0-row latch in JdbcSourceOffsetProvider.onTaskCommitted [fix](streaming-job) streaming CDC reliability fixes (scheduling latch, fetch-meta reason, PG db name, flaky case) Jun 10, 2026
@JNSimba JNSimba requested a review from Copilot June 10, 2026 02:33
@JNSimba

JNSimba commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

/review

@JNSimba JNSimba force-pushed the fix_streaming_ontaskcommitted_latch branch from ae09bf5 to a019edb Compare June 10, 2026 02:34

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR applies several reliability fixes to streaming CDC jobs, focusing on preventing scheduler stalls, preserving pause failure reasons, validating PostgreSQL config edge cases earlier, and stabilizing a flaky PG CDC regression test.

Changes:

  • Remove the “0-row/0-byte commit => no more data” latch in JdbcSourceOffsetProvider to avoid freezing streaming scheduling.
  • Reorder fetchMeta() pause/failure handling in StreamingInsertJob and add a new FE config + validator to fail fast on over-long PostgreSQL database names.
  • Stabilize test_streaming_postgres_job_special_offset by draining any lingering CDC reader behavior after PAUSE before inserting “before mark” rows.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
regression-test/suites/job_p0/streaming_job/cdc/test_streaming_postgres_job_special_offset.groovy Adds deterministic “drain lingering reader” logic to reduce PAUSE/ALTER-offset flakiness.
fe/fe-core/src/main/java/org/apache/doris/job/offset/jdbc/JdbcSourceOffsetProvider.java Removes an incorrect onTaskCommitted() override that could prematurely stop scheduling.
fe/fe-core/src/main/java/org/apache/doris/job/extensions/insert/streaming/StreamingInsertJob.java Reorders pause vs. failure-reason setting in fetchMeta() to address a race around PAUSED reason visibility.
fe/fe-core/src/main/java/org/apache/doris/job/extensions/insert/streaming/PostgresResourceValidator.java Adds fail-fast validation for PostgreSQL database name length to avoid slot/publication lookup inconsistencies.
fe/fe-common/src/main/java/org/apache/doris/common/Config.java Introduces streaming_pg_max_identifier_length config (default 63).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requested changes.

Critical checkpoint conclusions:

  • Goal/test: the PR targets streaming CDC scheduling, fetch-meta failure preservation, PostgreSQL database-name validation, and a flaky special-offset regression. The scheduling/fetch-meta/test changes line up with the goal, but the PG database-name guard is still incomplete for byte-length truncation.
  • Scope: the code changes are small and focused.
  • Concurrency/lifecycle: the fetchMeta ordering is consistent with the job write lock and task-success callback flow; I found no new static lifecycle issue.
  • Config: the new config is read dynamically at validation time, but the validation must use PostgreSQL byte-length semantics.
  • Compatibility/parallel paths: no protocol or storage-format compatibility issue found; CREATE JOB and cdc_stream TVF both route through PostgresResourceValidator, and ALTER keeps source identity immutable.
  • Tests: the regression drain change does not require an .out update, but the new PG database-name validation needs coverage for the byte-length edge.
  • Observability/performance/transactions: no additional issue found.

User focus: no additional user-provided review focus was present.

@JNSimba JNSimba force-pushed the fix_streaming_ontaskcommitted_latch branch from a019edb to 6a29706 Compare June 10, 2026 09:24
@JNSimba

JNSimba commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

/review

@JNSimba

JNSimba commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

run buildall

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes for one current regression in the added FE unit test.

Critical checkpoint conclusions:

  • Goal and tests: the PR targets the stated streaming CDC reliability issues and adds focused coverage, but the new FE unit test currently uses an API unavailable to fe-core's Java 8 release compilation, so that coverage can fail at compile time.
  • Scope: the GitHub PR diff is focused on the six streaming CDC files listed in the PR metadata.
  • Concurrency and lifecycle: reviewed the scheduler, fetch-meta, task-success, and offset-provider paths. I did not add a duplicate for the already-known fetch-meta failure-reason thread; no additional distinct concurrency issue was found in this pass.
  • Configuration: the new mutable master-only PG identifier-length config is only used during CREATE JOB validation; I found no runtime propagation requirement beyond that create-time check.
  • Compatibility and persistence: no FE-BE protocol, storage-format, or transaction-format compatibility issue was introduced by the current diff.
  • Parallel paths: the JDBC CDC zero-row scheduling path now relies on fetchRemoteMeta/hasMoreDataToConsume, which matches the stated fix; no missing parallel provider path was found.
  • Test outputs: the regression test keeps ordered qt output stable after cleaning the probe row, but FE UT is currently blocked by the Java API issue below. GitHub also reports FE UT failing on this PR; I did not run local tests because this runner review was constrained to avoid filesystem access outside the checkout.
  • Observability, performance, and security: no additional issue found.

User focus: review_focus.txt had no additional user-provided focus points.

// The byte-based check must reject it before connecting (validate fails on the very first line).
@Test
public void testRejectMultibyteOverLongDatabaseName() {
String dbName = "库".repeat(22);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is compiled in fe-core, whose pom.xml configures maven-compiler-plugin with <release>8</release>. String.repeat was added in Java 11, so the FE UT compile can fail before this test runs. Please use a Java-8-compatible helper here, for example the existing commons-lang StringUtils.repeat(...) pattern used elsewhere in FE tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants