Skip to content

PHOENIX-7845 ReplicationLogGroup initialization resilience to standby cluster unavailability#2466

Open
tkhurana wants to merge 2 commits intoapache:PHOENIX-7562-feature-newfrom
tkhurana:PHOENIX-7562-feature-new
Open

PHOENIX-7845 ReplicationLogGroup initialization resilience to standby cluster unavailability#2466
tkhurana wants to merge 2 commits intoapache:PHOENIX-7562-feature-newfrom
tkhurana:PHOENIX-7562-feature-new

Conversation

@tkhurana
Copy link
Copy Markdown
Contributor

@tkhurana tkhurana commented May 7, 2026

Summary

  • Defer peer shard manager creation from ReplicationLogGroup.init() to individual mode consumers (SyncModeImpl, SyncAndForwardModeImpl, ReplicationLogDiscoveryForwarder)
  • When the standby namenode is unavailable at startup, the group gracefully degrades from SYNC to STORE_AND_FORWARD via the existing updateModeOnFailure() path
  • Change ReplicationLogGroup.get() to throw IOException instead of RuntimeException, so callers in IndexRegionObserver get proper error classification (not misclassified as IndexBuildingFailureException)
  • Forwarder lazily creates its peer shard manager with automatic retry on subsequent rounds
  • Replace createStandbyLog()/createFallbackLog() with single createReplicationLog(shardManager) factory

Test plan

  • testInitDegradesToSafWhenPeerUnavailable — peer unavailable at startup → SAF mode, writes succeed
  • testInitFailsWhenLocalUnavailable — local FS unavailable → init fails with IOException
  • testForwarderRetriesPeerCreation — forwarder retries peer shard manager on next round after initial failure
  • All 41 existing ReplicationLogGroupTest tests pass
  • All 3 ReplicationLogDiscoveryForwarderTest tests pass
  • All 2 ReplicationLogTest tests pass
  • HABaseIT integration tests (running)

@tkhurana tkhurana requested review from apurtell and kadirozde May 7, 2026 22:16
…eption

- Replace RuntimeException wrapping in get() with UncheckedIOException to
  avoid misclassifying unrelated RuntimeExceptions with IOException causes
- Bound peer shard manager creation with configurable timeout (default 10s)
  via CompletableFuture.get() to prevent blocking the disruptor handler
  thread on peer NN outage; TimeoutException triggers SAF degradation
- Consolidate peer shard manager into a single lazy synchronized accessor
  on ReplicationLogGroup; remove per-component caching from forwarder
- Cancel the in-flight future on timeout to release the ForkJoinPool thread
- Add test for timeout-triggered SAF degradation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant