[fix](streaming-job) streaming CDC reliability fixes (scheduling latch, fetch-meta reason, PG db name, flaky case)#64310
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
…ingering reader before ALTER
… task success cannot clear it
|
/review |
ae09bf5 to
a019edb
Compare
There was a problem hiding this comment.
Pull request overview
This PR applies several reliability fixes to streaming CDC jobs, focusing on preventing scheduler stalls, preserving pause failure reasons, validating PostgreSQL config edge cases earlier, and stabilizing a flaky PG CDC regression test.
Changes:
- Remove the “0-row/0-byte commit => no more data” latch in
JdbcSourceOffsetProviderto avoid freezing streaming scheduling. - Reorder
fetchMeta()pause/failure handling inStreamingInsertJoband add a new FE config + validator to fail fast on over-long PostgreSQL database names. - Stabilize
test_streaming_postgres_job_special_offsetby draining any lingering CDC reader behavior after PAUSE before inserting “before mark” rows.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| regression-test/suites/job_p0/streaming_job/cdc/test_streaming_postgres_job_special_offset.groovy | Adds deterministic “drain lingering reader” logic to reduce PAUSE/ALTER-offset flakiness. |
| fe/fe-core/src/main/java/org/apache/doris/job/offset/jdbc/JdbcSourceOffsetProvider.java | Removes an incorrect onTaskCommitted() override that could prematurely stop scheduling. |
| fe/fe-core/src/main/java/org/apache/doris/job/extensions/insert/streaming/StreamingInsertJob.java | Reorders pause vs. failure-reason setting in fetchMeta() to address a race around PAUSED reason visibility. |
| fe/fe-core/src/main/java/org/apache/doris/job/extensions/insert/streaming/PostgresResourceValidator.java | Adds fail-fast validation for PostgreSQL database name length to avoid slot/publication lookup inconsistencies. |
| fe/fe-common/src/main/java/org/apache/doris/common/Config.java | Introduces streaming_pg_max_identifier_length config (default 63). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Requested changes.
Critical checkpoint conclusions:
- Goal/test: the PR targets streaming CDC scheduling, fetch-meta failure preservation, PostgreSQL database-name validation, and a flaky special-offset regression. The scheduling/fetch-meta/test changes line up with the goal, but the PG database-name guard is still incomplete for byte-length truncation.
- Scope: the code changes are small and focused.
- Concurrency/lifecycle: the fetchMeta ordering is consistent with the job write lock and task-success callback flow; I found no new static lifecycle issue.
- Config: the new config is read dynamically at validation time, but the validation must use PostgreSQL byte-length semantics.
- Compatibility/parallel paths: no protocol or storage-format compatibility issue found; CREATE JOB and cdc_stream TVF both route through PostgresResourceValidator, and ALTER keeps source identity immutable.
- Tests: the regression drain change does not require an .out update, but the new PG database-name validation needs coverage for the byte-length edge.
- Observability/performance/transactions: no additional issue found.
User focus: no additional user-provided review focus was present.
a019edb to
6a29706
Compare
|
/review |
|
run buildall |
There was a problem hiding this comment.
Requesting changes for one current regression in the added FE unit test.
Critical checkpoint conclusions:
- Goal and tests: the PR targets the stated streaming CDC reliability issues and adds focused coverage, but the new FE unit test currently uses an API unavailable to fe-core's Java 8 release compilation, so that coverage can fail at compile time.
- Scope: the GitHub PR diff is focused on the six streaming CDC files listed in the PR metadata.
- Concurrency and lifecycle: reviewed the scheduler, fetch-meta, task-success, and offset-provider paths. I did not add a duplicate for the already-known fetch-meta failure-reason thread; no additional distinct concurrency issue was found in this pass.
- Configuration: the new mutable master-only PG identifier-length config is only used during CREATE JOB validation; I found no runtime propagation requirement beyond that create-time check.
- Compatibility and persistence: no FE-BE protocol, storage-format, or transaction-format compatibility issue was introduced by the current diff.
- Parallel paths: the JDBC CDC zero-row scheduling path now relies on fetchRemoteMeta/hasMoreDataToConsume, which matches the stated fix; no missing parallel provider path was found.
- Test outputs: the regression test keeps ordered qt output stable after cleaning the probe row, but FE UT is currently blocked by the Java API issue below. GitHub also reports FE UT failing on this PR; I did not run local tests because this runner review was constrained to avoid filesystem access outside the checkout.
- Observability, performance, and security: no additional issue found.
User focus: review_focus.txt had no additional user-provided focus points.
| // The byte-based check must reject it before connecting (validate fails on the very first line). | ||
| @Test | ||
| public void testRejectMultibyteOverLongDatabaseName() { | ||
| String dbName = "库".repeat(22); |
There was a problem hiding this comment.
This test is compiled in fe-core, whose pom.xml configures maven-compiler-plugin with <release>8</release>. String.repeat was added in Java 11, so the FE UT compile can fail before this test runs. Please use a Java-8-compatible helper here, for example the existing commons-lang StringUtils.repeat(...) pattern used elsewhere in FE tests.
Proposed changes
Several independent reliability fixes for streaming CDC jobs.
Drop the 0-row latch in
JdbcSourceOffsetProvider.onTaskCommitted. SettinghasMoreData = falseon a 0-row/0-byte commit was wrong: a 0-row task does not mean the source is caught up (e.g. a large upstream transaction is still buffered on the cdc_client side). The latch made the scheduler stop dispatching tasks and could freeze streaming. The interface default is a no-op, so the override is simply removed;hasMoreDatais still advanced byfetchRemoteMeta()andhasMoreDataToConsume().Set the fetch-meta failure reason after pausing.
fetchMeta()set the failure reason and then paused. A concurrent task-success callback (which clears the reason under the job write lock) could wipe the freshly-set reason, leaving the jobPAUSEDwith an empty reason and auto-resume disabled. Pausing first orders the reason update after any in-flight success callback, so the reason survives.Reject an over-long PostgreSQL database name at CREATE JOB. PostgreSQL truncates a database name longer than its identifier limit, but the replication-slot lookup compares the full configured name, so an existing slot looks missing and the job then fails creating a duplicate slot. The connection itself still succeeds (PG truncates at connect time), so no connectivity check catches it. A new config
streaming_pg_max_identifier_length(default 63, adjustable for a larger NAMEDATALEN build) gates a fail-fast check.Stabilize
test_streaming_postgres_job_special_offset. PAUSE cancels the FE task but the in-flight cdc_client reader keeps polling (up tomax_interval) and could stream-load rows inserted right after the pause, defeating the ALTER-offset reposition. The test now deterministically drains the lingering reader before inserting the before-mark rows.