fix(cosmos-spark): validate bounded change feed EOF against planned endLsn (#49380) by xinlian12 · Pull Request #49390 · Azure/azure-sdk-for-java

xinlian12 · 2026-06-05T16:32:01Z

Summary

Closes a silent data-loss path in the Cosmos DB Spark change feed connector. For bounded change feed reads, TransientIOErrorsRetryingIterator now validates that the latest continuation LSN has reached the planned endLsn before reporting end-of-stream, and throws a retryable OperationCancelledException otherwise so the existing transient-IO retry path can recover (or fail the Spark task instead of silently committing).

Issue

Fixes #49380

Root Cause

sdk/cosmos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/TransientIOErrorsRetryingIterator.scala::hasNextInternalCore

When the underlying feedResponseIterator returns false from hasNext (publisher terminated), the iterator returned Some(false) unconditionally. validateNextLsn guards the upper edge for records (no _lsn > endLsn is returned), but no symmetric check exists for the lower edge of EOF: if the publisher terminates with the latest continuation LSN still below endLsn, Spark sees a clean end-of-stream and commits the planned end offset.

Triggers for the underlying premature publisher termination can include slow replicas / slow regions in the Cosmos service, paging, Reactor, networking, or cancellation interactions. The fix is independent of the trigger: the connector must defend against EOF below the planned endLsn.

Fix

Added a validateEofProgressOrThrow() helper invoked inside hasNextInternalCore immediately before returning Some(false) on underlying EOF:

Unbounded path (endLsn = None): returns immediately. Behavior is unchanged.
Page(s) consumed: parses the latest continuation token with SparkBridgeImplementationInternal.extractContinuationTokensFromChangeFeedStateJson and takes the minimum LSN across all ranges. EOF is allowed only when minLatestLsn >= endLsn. The min-across-ranges semantics matches the issue's validation table for multi-range continuations.
No page consumed: EOF is allowed only when startLsn == endLsn (legitimate empty bounded interval). startLsn is plumbed in via a new additive constructor parameter on the iterator and derived consistently in ChangeFeedPartitionReader.
Failure case: throws OperationCancelledException, which extends CosmosException with timeout status. Exceptions.canBeTransientFailure treats it as transient, so the iterator's existing executeWithRetry loop retries up to maxRetryCount and then propagates so the Spark task fails — preventing the silent checkpoint advance.

Alternatives considered (and rejected): a boolean-returning validator (would still need to convert to retryable exception); a Reactor-side sentinel wrapper (more invasive, risks altering prefetch/subscription); using only the current-continuation helper (misses multi-range lag).

How the Fix Works

underlying feedResponseIterator.hasNext == false
                │
                ▼
   validateEofProgressOrThrow()
                │
   endLsn isEmpty? ─── yes ──► return Some(false)   (unbounded, unchanged)
                │
                no
                │
                ▼
   lastContinuationToken null?
        │                       │
       yes                      no
        │                       │
        ▼                       ▼
  startLsn == endLsn?     parse min LSN across ranges
        │                       │
   yes/no                       │
        │                       ▼
        ▼                  minLatestLsn >= endLsn?
  complete / throw              │
                          yes ──┤
                                ▼
                          complete
                                │
                          no ──► throw OperationCancelledException
                                       │
                                       ▼
                               executeWithRetry retries
                                       │
                                       ▼
                               exhausted ─► Spark task fails
                                                (no silent commit)

Testing

Existing tests in TransientIOErrorsRetryingIteratorSpec continue to pass (unbounded path is unaffected).
Canonical repro test added: bounded read with endLsn=20, final continuation at LSN 15 ⇒ OperationCancelledException thrown (previously returned false silently).
Additional bounded-EOF validation cases (rows 2–6 from the issue's validation matrix) are added by the Test Planner in a follow-up commit in this same PR.

Risk Assessment

Breaking change: No. New constructor parameter has a default of None; default behavior for unbounded reads is unchanged.
Behavior change: Additive — bounded reads gain an EOF-progress validation step.
Performance impact: None on the hot path (validation runs only at EOF, once per iterator lifecycle).
Customer-visible consequence: Some workloads that today silently skip records on bounded change feed EOF will now visibly fail the Spark task after retries are exhausted (desired: at-least-once delivery preserved over silent data loss).

Cross-SDK Impact

Populated by the orchestrator after Phase 7.

…ndLsn When TransientIOErrorsRetryingIterator reaches EOF on a bounded change feed read, validate that the latest continuation LSN (min across ranges) has reached the planned endLsn. If not, throw OperationCancelledException so the existing transient-IO retry path can recover or fail the task, preventing silent commits of unread bounded intervals. Fixes Azure#49380 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Covers issue Azure#49380 validation table rows 2-6, plus retry-path and row-level upper-bound regression checks for TransientIOErrorsRetryingIterator. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…er consolidation + diagnostics - F1: catch NonFatal in extractLatestMinLsnFromContinuation, wrap as retryable OperationCancelledException - F2: drop startLsn default; require callers to pass it explicitly - F3: consolidate min-LSN-across-ranges parsing into SparkBridgeImplementationInternal - F4: logWarning on first under-run with planned vs observed LSN - F6: tighten getLatestContinuationToken to Option(lastContinuationToken.get()) - F7: logError before throw + clearer diagnostic context - F9: render startLsn/endLsn without raw Option wrapper in exception text - S2: verified continuation includes both child ranges after split (see code comment) - S7: add cancellation-hint to exception message - F8/S8 test additions: malformed continuation, multi-page progression, degenerate endLsn==startLsn with continuation Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Removed redundant logWarning before logError on EOF under-run path (deep reviewer F11; previously logged the same message twice and could double-count on log-based alerting). - Inlined SparkBridgeImplementationInternal.extractMinLatestLsnFrom- ChangeFeedContinuationOrFallback at the call site and removed the one-line extractLatestMinLsnFromContinuation private wrapper (deep reviewer F12). - Skeptic Lens NonFatal/InterruptedException concern verified factually incorrect: scala.util.control.NonFatal.apply explicitly excludes InterruptedException, so the wrap cannot convert cancellation into a retry loop. - Caller-site audit verified: all 4 production + 2 test call sites pass startLsn explicitly. Shaded azure-cosmos-spark_4-1_2-13 uses generated target/shared-sources/ mirrors that sync automatically. 17/17 unit tests still pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

xinlian12 · 2026-06-05T17:24:28Z

/azp run java - cosmos

azure-pipelines · 2026-06-05T17:24:54Z

Azure Pipelines successfully started running 1 pipeline(s).

xinlian12 · 2026-06-05T18:38:53Z

Retrying CI: the two failing checks are unrelated to this PR (Spark connector / Scala only).

java - cosmos - ci (Test Emulator windows2022_EmulatorOnlyIntegrationTestsTcpJava8) — Java 8 TCP emulator integration test; this PR does not touch the core SDK and Spark connector requires Java 11+. Recent merged cosmos PRs (chore(benchmark): wire excludedRegionsList into Cosmos*RequestOptions #49313, Cosmos: Promote Full Fidelity Change Feed (AllVersionsAndDeletes) API to GA #49283, Increment versions for cosmos releases #49034, Cosmos release 2026-05-01 #49003, [Cosmos] Make JsonSerializable.propertyBag final to fix sporadic NPE in DatabaseAccount lazy getters #49258) all passed this same stage, suggesting a one-off Windows emulator flake.
java - cosmos (Build Test macoslatest_18_NotFromSource_TestsOnly) — CANCELLED (Java 18; Spark module is not exercised on Java 18).

/azp run java - cosmos
/azp run java - cosmos - ci

github-actions Bot added the Cosmos label Jun 5, 2026

xinlian12 and others added 3 commits June 5, 2026 09:40

test(cosmos-spark): add bounded change feed EOF validation matrix

1c24fd6

Covers issue Azure#49380 validation table rows 2-6, plus retry-path and row-level upper-bound regression checks for TransientIOErrorsRetryingIterator. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

xinlian12 closed this Jun 5, 2026

xinlian12 mentioned this pull request Jun 5, 2026

[Cosmos]ThrowIllegalStateExceptionIfChangeFeedFetchStoppedBeforeEndLSN #49393

Merged

xinlian12 deleted the fix/issue-49380-spark-bounded-changefeed-eof branch June 5, 2026 19:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cosmos-spark): validate bounded change feed EOF against planned endLsn (#49380)#49390

fix(cosmos-spark): validate bounded change feed EOF against planned endLsn (#49380)#49390
xinlian12 wants to merge 4 commits into
Azure:mainfrom
xinlian12:fix/issue-49380-spark-bounded-changefeed-eof

xinlian12 commented Jun 5, 2026

Uh oh!

xinlian12 commented Jun 5, 2026

Uh oh!

azure-pipelines Bot commented Jun 5, 2026

Uh oh!

xinlian12 commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xinlian12 commented Jun 5, 2026

Summary

Issue

Root Cause

Fix

How the Fix Works

Testing

Risk Assessment

Cross-SDK Impact

Uh oh!

xinlian12 commented Jun 5, 2026

Uh oh!

azure-pipelines Bot commented Jun 5, 2026

Uh oh!

xinlian12 commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant