PHOENIX-7845 ReplicationLogGroup initialization resilience to standby cluster unavailability by tkhurana · Pull Request #2466 · apache/phoenix

tkhurana · 2026-05-07T22:13:37Z

Summary

Defer peer shard manager creation from ReplicationLogGroup.init() to individual mode consumers (SyncModeImpl, SyncAndForwardModeImpl, ReplicationLogDiscoveryForwarder)
When the standby namenode is unavailable at startup, the group gracefully degrades from SYNC to STORE_AND_FORWARD via the existing updateModeOnFailure() path
Change ReplicationLogGroup.get() to throw IOException instead of RuntimeException, so callers in IndexRegionObserver get proper error classification (not misclassified as IndexBuildingFailureException)
Forwarder lazily creates its peer shard manager with automatic retry on subsequent rounds
Replace createStandbyLog()/createFallbackLog() with single createReplicationLog(shardManager) factory

Test plan

testInitDegradesToSafWhenPeerUnavailable — peer unavailable at startup → SAF mode, writes succeed
testInitFailsWhenLocalUnavailable — local FS unavailable → init fails with IOException
testForwarderRetriesPeerCreation — forwarder retries peer shard manager on next round after initial failure
All 41 existing ReplicationLogGroupTest tests pass
All 3 ReplicationLogDiscoveryForwarderTest tests pass
All 2 ReplicationLogTest tests pass
HABaseIT integration tests (running)

… cluster unavailability

…eption - Replace RuntimeException wrapping in get() with UncheckedIOException to avoid misclassifying unrelated RuntimeExceptions with IOException causes - Bound peer shard manager creation with configurable timeout (default 10s) via CompletableFuture.get() to prevent blocking the disruptor handler thread on peer NN outage; TimeoutException triggers SAF degradation - Consolidate peer shard manager into a single lazy synchronized accessor on ReplicationLogGroup; remove per-component caching from forwarder - Cancel the in-flight future on timeout to release the ForkJoinPool thread - Add test for timeout-triggered SAF degradation

PHOENIX-7845 ReplicationLogGroup initialization resilience to standby…

e86bb0a

… cluster unavailability

tkhurana requested review from apurtell and kadirozde May 7, 2026 22:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PHOENIX-7845 ReplicationLogGroup initialization resilience to standby cluster unavailability#2466

PHOENIX-7845 ReplicationLogGroup initialization resilience to standby cluster unavailability#2466
tkhurana wants to merge 2 commits intoapache:PHOENIX-7562-feature-newfrom
tkhurana:PHOENIX-7562-feature-new

tkhurana commented May 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tkhurana commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tkhurana commented May 7, 2026 •

edited

Loading