fix: SDK-4508 offload SessionService.onFocus / onUnfocused off the main thread by abdulraqeeb33 · Pull Request #2646 · OneSignal/OneSignal-Android-SDK

abdulraqeeb33 · 2026-05-12T14:08:15Z

Summary

ANR-dump analysis of logs/2026-05-12 (500 ANR entries, all on sdk_background_threading) attributes 25 of 500 ANRs (5.0%) to SessionService.onFocus blocking the main thread. Stack signature:

java.util.concurrent.LinkedBlockingQueue.offer
java.util.concurrent.ThreadPoolExecutor.execute
kotlinx.coroutines.scheduling.CoroutineScheduler.dispatch
com.onesignal.common.threading.OneSignalDispatchers.<lazy chain>
... operation-repo / IAM trigger eval handlers ...
com.onesignal.session.internal.session.impl.b$c.invoke   (sessionLifeCycleNotifier.fire)
com.onesignal.session.internal.session.impl.b.onFocus    (SessionService.onFocus)
com.onesignal.core.internal.application.impl.a.handleFocus  (fanout on main)

SessionService.onFocus runs synchronously on the main thread via ApplicationService.handleFocus -> applicationLifecycleNotifier.fire. Inside, sessionLifeCycleNotifier.fire { onSessionStarted / onSessionActive } invokes the registered session-lifecycle handlers (operation repo, IAM trigger eval, etc.) synchronously; the first one to touch OneSignalDispatchers pays the cold-init cost on the main thread (5–20s blocks before the watchdog fires).

Same SDK-4505 / SDK-4506 / SDK-4507 fanout pattern — but on a pre-existing handler that predates background threading.

Linear: SDK-4508

Fix

Wrap onFocus / onUnfocused bodies in runOnSerialIOIfBackgroundThreading (the same gated helper SDK-4505, SDK-4506, and SDK-4507 use). FF off retains the original inline semantics for the rollout control cohort and existing synchronous tests; FF on routes through the dedicated serial IO dispatcher so the fanout no longer runs on main.

Timing semantics preserved

The body of onFocus writes session.startTime = _time.currentTimeMillis and session.focusTime = _time.currentTimeMillis to record when Android delivered the lifecycle event. If we simply deferred everything to the serial dispatcher, those timestamps would shift to whenever the dispatcher got around to the block — which can be hundreds of ms later under load.

To keep the existing semantics:

onFocus captures _time.currentTimeMillis on the caller's thread BEFORE the wrapper, passes it into the deferred handleOnFocus(firedOnSubscribe, focusTimeMs) block.
onUnfocused does the same so activeDuration accounting stays accurate (dt = unfocusTimeMs - session.focusTime, both captured on the main thread).

Base branch

Stacked on ar/sdk-4506-notifications-onfocus-anr (#2644) because the fix relies on runOnSerialIOIfBackgroundThreading, which was introduced in that PR. Rebases cleanly onto main after #2643 and #2644 land.

Test plan

SessionServiceTests existing assertions about state mutation continue to pass (the FF defaults to off, so the wrapper runs inline; the new private handleOnFocus / handleOnUnfocused preserve behavior).
Adds three new regression tests:
- onFocus dispatches the session-mutation body through runOnSerialIOIfBackgroundThreading.
- onUnfocused dispatches the activeDuration update through runOnSerialIOIfBackgroundThreading.
- Rapid onUnfocused -> onFocus burst routes both events through the same gated helper in submission order (mirrors SDK-4505 BackgroundManager burst coverage; protects against activeDuration drift if the events ever raced across an unbounded IO pool).
:core detekt — no new findings.

Made with Cursor

…off the main thread ApplicationService.handleFocus dispatches onFocus/onUnfocused on the main thread (via Activity.onStart -> dispatchActivityStarted). BackgroundManager was previously calling JobScheduler.cancel/schedule synchronously from those callbacks. JobScheduler operations are synchronous Binder transactions that can stall for many seconds on some devices (notably Xiaomi/MIUI under power-save), producing main-thread ANRs. Reported example: 20,796 ms main-thread block on a Xiaomi 25078RA3EL / Android 15 with sample insertId ycae33cjpu6gcyut on SDK v5.9.0. Route cancelSyncTask() and scheduleBackground() through suspendifyOnIO so the actual JobScheduler call happens on OneSignalDispatchers.IO (or Dispatchers.IO when the BACKGROUND_THREADING flag is off). State mutations inside cancelSyncTask/scheduleSyncTask are already guarded by synchronized(lock); JobScheduler.cancel is idempotent and scheduleSyncServiceAsJob is de-duped by isJobIdRunning(), so running on a multi-threaded pool is safe. Co-authored-by: Cursor <cursoragent@cursor.com>

Adds the regression test promised in the PR TODO. Verifies that BackgroundManager.onFocus() and onUnfocused() route through OneSignalDispatchers.launchOnIO (i.e. via suspendifyOnIO) instead of running JobScheduler operations inline on the calling thread, which was the source of the 20s main-thread ANR reported in SDK-4505. Mirrors the existing pattern in ThreadUtilsFeatureFlagTests (mockkObject(OneSignalDispatchers) + verify launchOnIO is invoked exactly once per lifecycle callback) with BACKGROUND_THREADING toggled on, since that is the configuration the production ANR was reported under. Brings diff coverage on the touched executable lines from 0% to 100%, satisfying the PR diff-coverage gate. Co-authored-by: Cursor <cursoragent@cursor.com>

…patcher The initial SDK-4505 fix routed BackgroundManager.onFocus/onUnfocused through suspendifyOnIO, which dispatches onto the multi-threaded OneSignal IO pool (core 2 / max 3). That solved the main-thread ANR but left a subtler race for rapid lifecycle bursts: an `unfocused -> focused` pair submitted in close succession can be picked up by two IO worker threads and executed in the opposite order, leaving a SyncJob scheduled while the app is foregrounded (or wiping a freshly scheduled job when the user has just backgrounded). Each individual operation is internally safe (synchronized(lock), the nextScheduledSyncTimeMs dedupe, and the isJobIdRunning short-circuit) but last-writer-wins across the pool is not preserved, and the problem will worsen as these handlers grow per-event work (session timing, focus counters, analytics) that can't be reconciled from a single state snapshot. Introduces a dedicated single-thread dispatcher for order-sensitive work: - OneSignalDispatchers.SerialIO: a daemon Executors.newSingleThreadExecutor named "OneSignal-SerialIO", with a Dispatchers.IO.limitedParallelism(1) fallback if our executor fails to start. Submission order on the caller thread equals execution order on this thread, so callers no longer need their own synchronization to preserve sequencing. - launchOnSerialIO(block) on OneSignalDispatchers. - suspendifyOnSerialIO(block) helper in ThreadUtils, mirroring the existing suspendifyOnIO shape. Always routes through OneSignalDispatchers regardless of ThreadingMode.useBackgroundThreading, since the ordering guarantee is the whole point of this entry point and the single thread carries none of the resource concerns the BACKGROUND_THREADING FF gates. BackgroundManager.onFocus/onUnfocused now use suspendifyOnSerialIO. Refactor getStatus() into small executorStatus / scopeStatus helpers to extend the diagnostic output to the new dispatcher without tripping detekt's LongMethod / ComplexMethod / StringLiteralDuplication checks (the refactor incidentally brings the function well under those thresholds). IOMockHelper stubs suspendifyOnSerialIO and launchOnSerialIO alongside the existing entry points, so any spec using the helper transparently waits on serial work via awaitIO(). BackgroundManagerTests now verifies launchOnSerialIO is the dispatcher target, and adds a `rapid unfocus -> focus burst` case that uses verifyOrder to assert both events are submitted to the serial dispatcher in main-thread submission order. Co-authored-by: Cursor <cursoragent@cursor.com>

…ing FF Wraps the suspendifyOnSerialIO offload added in the prior two commits in `if (ThreadingMode.useBackgroundThreading) { ... } else { ... }` so the fix can ship as an A/B against the legacy inline path rather than as an unconditional change: FF on -> suspendifyOnSerialIO { cancelSyncTask() / scheduleBackground() } Routes through OneSignalDispatchers.SerialIO, off the main thread, with submission order preserved. FF off -> cancelSyncTask() / scheduleBackground() Legacy inline path; retains the SDK-4505 ANR for the control cohort so we can measure the fix's impact. The existing `sdk_background_threading` FF is APP_STARTUP-activated (see FeatureFlag.kt) so a given session is latched on one path and won't bounce mid-run, which keeps the comparison clean. The production ANR sample was on FF=ON before this PR (because the legacy code called JobScheduler.cancel directly, bypassing the threading helpers), so this gate is repurposing the flag to also control "does BackgroundManager offload its lifecycle calls" — once the FF=ON cohort confirms the ANR rate drop, we'll flip the FF globally or remove the gate. BackgroundManagerTests now covers both arms: - 3 existing cases run with useBackgroundThreading=true and assert OneSignalDispatchers.launchOnSerialIO is invoked (the offload + the rapid-burst ordering case). - 2 new cases run with useBackgroundThreading=false and assert launchOnSerialIO is NOT invoked, i.e. onFocus/onUnfocused execute on the calling thread. A small applicationServiceWithStubbedJobScheduler helper stubs IApplicationService.appContext.getSystemService so the inline path can complete without a Robolectric Context. Co-authored-by: Cursor <cursoragent@cursor.com>

…etStatus helpers Brings diff coverage above the 80% gate (50.0% -> 92.5%) by exercising the new SerialIO plumbing introduced earlier in the PR. OneSignalDispatchersTests: - SerialIO dispatcher executes work on a background thread. - launchOnSerialIO runs N queued tasks on a single thread in submission order (the SDK-4505 ordering contract). - An exception in one SerialIO task does not poison the queue for subsequent tasks. - getStatus() reports the new SerialIO Executor / Scope lines. - getPerformanceMetrics() reports the new SerialIO Queue line and folds serialCompleted into the total. - executorStatus / scopeStatus return "Active" / "Shutdown" / "Cancelled" on the happy paths and the "<name> Not initialized <message-or-Unknown error>" string when the underlying isShutdown / isActive lambda throws. ThreadUtilsFeatureFlagTests: - suspendifyOnSerialIO always routes through OneSignalDispatchers.launchOnSerialIO regardless of ThreadingMode.useBackgroundThreading. The serial ordering guarantee is the whole point of this entry point, so we want a regression test that pins down that contract. - suspendifyOnSerialIO swallows exceptions thrown inside the block (matches the production try/catch behavior). To make the defensive catch branches in executorStatus / scopeStatus reachable from unit tests, they are promoted from `private inline` to `internal`. The catch only fires when the underlying lazy executor / scope is in a failed state (re-throws on every isShutdown / isActive access), which is not triggerable from a unit test without forcing Executors.newSingleThreadExecutor to fail to construct. Exposing the helpers and passing throwing lambdas in the test is the cheapest way to cover that branch without simulating a JVM-level failure. The remaining 4 uncovered lines on OneSignalDispatchers.kt are the catch blocks of the lazy executor / dispatcher / getPerformanceMetrics initializers themselves, which are genuinely unreachable from a unit test (require Executors / asCoroutineDispatcher / the entire performance-metrics try{} body to throw) - those are kept as defensive programming. Co-authored-by: Cursor <cursoragent@cursor.com>

…ated lifecycle offload Introduces a small ThreadUtils helper that conditionally dispatches a non-suspending block to OneSignalDispatchers.SerialIO when the SDK_BACKGROUND_THREADING remote feature flag is on, and runs it inline on the caller thread when the flag is off. FF on -> suspendifyOnSerialIO { block() } // off-main, single-thread, ordered FF off -> block() // legacy inline path This is the same gating pattern SDK-4505 applied inline at the BackgroundManager call sites, lifted into a reusable helper so other order-sensitive lifecycle handlers (NotificationsManager.onFocus in the follow-up commit; SessionService and friends later) can opt into it with one wrapper call instead of repeating the if/else everywhere. Differs from the existing entry points by design: - suspendifyOnIO -> always offloads, multi-thread IO pool (no ordering guarantee). - suspendifyOnSerialIO -> always offloads, single-thread serial (ordering guarantee; FF-agnostic on purpose). - runOnSerialIOIfBackgroundThreading -> FF-gated. Used when we want an A/B comparison of the new behavior against the previous inline behavior in production. The block is non-suspending because the FF-off branch executes on the caller's thread - a suspending block there would force a runBlocking, which defeats the purpose of the comparison. IOMockHelper stubs the helper to run inline on the test thread by default so existing specs keep their observable behavior; tests that want to exercise the FF-on (offload) branch can override the stub. Regression tests on ThreadUtilsFeatureFlagTests cover both branches: - FF on -> launchOnSerialIO invoked, block not run inline - FF off -> launchOnSerialIO not invoked, block runs on caller thread Co-authored-by: Cursor <cursoragent@cursor.com>

…hind sdk_background_threading FF Production OTel sample insertId 9qy5s0ta0cwqwmb0 shows a 30.5 s main-thread ANR on vivo I2306 / Android 15: Application Not Responding: Main thread blocked for 30516ms at SQLiteConnection.nativeExecuteForLong ... (WorkDatabase_Impl / WorkManagerImpl.enqueueUniqueWork) at NotificationRestoreWorkManager.beginEnqueueingWork at NotificationsManager.refreshNotificationState at NotificationsManager.onFocus at ApplicationService.handleFocus at ApplicationService.onActivityStarted at Application.dispatchActivityStarted at Activity.onStart Same Activity-lifecycle fan-out as SDK-4505: Activity.onStart synchronously fires every IApplicationLifecycleHandler.onFocus on the main thread. NotificationsManager.onFocus invokes refreshNotificationState() which calls NotificationRestoreWorkManager.beginEnqueueingWork(), which: 1. Lazily constructs WorkManager (opens / migrates the SQLite store on first call). 2. Synchronously writes to the WorkManager DB via enqueueUniqueWork. Both are SQLite disk I/O on the main thread. Under storage contention, low memory, or OEM background daemons holding the fs lock, the call can stall for 10s of seconds and trip an ANR. This is the first foreground event after process start - the restored flag inside beginEnqueueingWork short-circuits on subsequent calls, so only the first onFocus eats the stall. (Subsequent rapid focus events still benefit from the helper because they go through it consistently; the stall just doesn't repeat per session.) Fix: route refreshNotificationState() through the new runOnSerialIOIfBackgroundThreading helper (added in the preceding commit). On the SDK_BACKGROUND_THREADING FF=on cohort, the work moves to OneSignalDispatchers.SerialIO - the same single-thread executor BackgroundManager already uses, so onFocus work across both handlers stays globally ordered (no risk of a stale NotificationsManager refresh interleaving with a fresh BackgroundManager schedule/cancel pair). On the FF=off cohort the inline behavior is retained, mirroring the SDK-4505 rollout strategy and giving us the A/B signal we need to confirm ANR rate drops on the FF=on cohort. Regression tests on NotificationsManagerTests assert: - onFocus dispatches refreshNotificationState through the gated helper (the contract that the call site stays off-main on FF=on). - A rapid onFocus burst routes both events through the gated helper in submission order (defense against future per-event work being added here without re-introducing the reorder hazard that the SerialIO dispatcher prevents). Co-authored-by: Cursor <cursoragent@cursor.com>

…ionController, FeatureFlagsRefreshService) behind sdk_background_threading FF Expand SDK-4506's scope to cover the other two lifecycle handlers that the ANR-dump analysis from logs/2026-05-12 surfaced as also stalling the main thread on cold start under `sdk_background_threading`: * NotificationPermissionController polling lifecycle listener — reads `_configModelStore.model.foregroundFetchNotificationPermissionInterval` and calls `pollingWaiter.wake()`, which dispatches a coroutine resume onto the IO pool. On cold start that hits the dispatcher / executor lazy chain inside OneSignalDispatchers and the construction cost is paid on the calling (main) thread. * FeatureFlagsRefreshService.onFocus / onUnfocused — calls `OneSignalDispatchers.launchOnIO` directly via `restartForegroundPolling`, same chain, same stall. Both move to `runOnSerialIOIfBackgroundThreading` (introduced earlier in this PR for NotificationsManager.onFocus). Identical rollout shape: * FF on -> SerialIO single-thread executor, off-main, ordered globally with BackgroundManager (SDK-4505) and NotificationsManager (this PR). * FF off -> inline on the lifecycle main thread (legacy behavior, retains the ANR for the control cohort). `FeatureFlagsRefreshService.onUnfocused` now qualifies the `synchronized` receiver as `synchronized(this@FeatureFlagsRefreshService)` so the lambda locks on the service instance — the same monitor `restartForegroundPolling` takes — rather than on the (no-receiver) lambda object. Tests: * :core FeatureFlagsRefreshServiceTests asserts onFocus / onUnfocused route through `runOnSerialIOIfBackgroundThreading` (3 invocations across start + direct onFocus + direct onUnfocused). * :notifications NotificationPermissionControllerTests asserts the polling lifecycle listener's onFocus / onUnfocused both dispatch through the helper. These two call sites originally landed in the SDK-4507 PR (#2645) but share the same root cause and rollout strategy as the NotificationsManager.onFocus offload, so consolidating them here keeps the FF rollout matrix one-knob simple and leaves SDK-4507 (#2645) to focus purely on the prewarm fix. :OneSignal:core + :OneSignal:notifications detekt + full unit suites green. Co-authored-by: Cursor <cursoragent@cursor.com>

…o avoid cold-start ANRs ANR-dump analysis (logs/2026-05-12, 500 entries, all on sdk_background_threading) attributes 47 / 500 ANRs (9.4%) to the SDK's own background-threading helpers stalling the main thread on cold start. The three call sites in the bucket are now all routed through runOnSerialIOIfBackgroundThreading by the SDK-4505 / SDK-4506 work, but the deeper root cause is that the very first IO consumer (whichever caller wins the race) pays the cost of constructing the entire OneSignalDispatchers lazy chain on its thread: ThreadPoolExecutor.execute -> LinkedBlockingQueue.offer CoroutineDispatcher.dispatch -> kotlinx.coroutines first-launch OneSignalDispatchers.IOScope.<init> (by lazy) OneSignalDispatchers.IO (by lazy) OneSignalDispatchers.ioExecutor (by lazy) That includes the kotlinx.coroutines MainDispatcherFactory ServiceLoader scan, executor + thread-factory construction, dispatcher wrapping, and SupervisorJob / CoroutineScope wiring. Under sdk_background_threading the first caller is typically an Activity-lifecycle handler or JobService.onStartJob - both on the main thread - and OTel records 5-20s blocks before the watchdog fires. OneSignalDispatchers.prewarm() spawns a dedicated short-lived "OneSignal-prewarm" daemon thread that submits one empty launch on each of IO / Default / SerialIO. That single thread pays the lazy-init cost end-to-end so the next production caller - even on the main thread - only sees the cheap "submit to already-constructed executor" path. * Idempotent: double-checked-locked prewarmStarted flag, so repeat calls from init / suspend init / JobService.onStartJob no-op cheaply. * Fire-and-forget: failures log and swallow; the existing Dispatchers.IO / SerialIO fallback paths in [IO] / [SerialIO] still apply if anything goes wrong, so a failed prewarm just means the first real caller pays the original cost. * Daemon thread at NORM_PRIORITY - 2 so prewarm never blocks process exit or starves UI work. Called from: * OneSignalImp.initWithContext(context, appId) (sync variant) * OneSignalImp.initWithContextSuspend(context, appId) (suspend variant, used by re-entrant suspend callers) * SyncJobService.onStartJob BEFORE suspendifyOnIO, because the JobService can fire before the host app's initWithContext runs. Tests (:core OneSignalDispatchersTests): * prewarm returns immediately on the caller and the daemon thread brings IO / Default / SerialIO + their scopes to Active. * prewarm is idempotent - second call does not spawn another OneSignal-prewarm thread (verified via thread-name scan). Scope reduction: the remaining-onFocus-handlers part of the original SDK-4507 change (NotificationPermissionController + FeatureFlagsRefreshService runOnSerialIOIfBackgroundThreading wrappers) was moved up the stack to SDK-4506 (#2644), where it sits next to NotificationsManager.onFocus since all three share the same FF-gated rollout shape. This PR is now focused purely on the prewarm fix. :OneSignal:core detekt + full unit suite green. Co-authored-by: Cursor <cursoragent@cursor.com>

…in thread ANR-dump analysis (logs/2026-05-12) attributes 25 of 500 ANRs (5.0%) to `SessionService.onFocus` blocking the main thread. Stack signature: java.util.concurrent.LinkedBlockingQueue.offer java.util.concurrent.ThreadPoolExecutor.execute kotlinx.coroutines.scheduling.CoroutineScheduler.dispatch com.onesignal.common.threading.OneSignalDispatchers.<lazy chain> ... operation-repo / IAM trigger eval handlers ... com.onesignal.session.internal.session.impl.b$c.invoke (sessionLifeCycleNotifier.fire) com.onesignal.session.internal.session.impl.b.onFocus (SessionService.onFocus) com.onesignal.core.internal.application.impl.a.handleFocus (fanout on main) SessionService.onFocus runs synchronously on the main thread via ApplicationService.handleFocus -> applicationLifecycleNotifier.fire. Inside, sessionLifeCycleNotifier.fire { onSessionStarted/Active } invokes the registered session-lifecycle handlers (operation repo, IAM trigger eval, etc.) synchronously; the first one to touch OneSignalDispatchers pays the cold-init cost on the main thread (5–20s blocks before the watchdog fires). This is the same SDK-4505 / SDK-4506 / SDK-4507 fanout pattern -- but on a pre-existing handler that predates background threading. Fix: wrap onFocus / onUnfocused bodies in runOnSerialIOIfBackgroundThreading (same gated helper SDK-4505 / SDK-4506 use). FF off retains the original inline semantics for the rollout control cohort and the existing synchronous-onFocus tests; FF on routes through the dedicated serial IO dispatcher so the fanout no longer runs on main. Timing semantics: * onFocus captures _time.currentTimeMillis on the caller's thread BEFORE the wrapper and passes it into the deferred handleOnFocus block, so session.startTime / session.focusTime still reflect when Android delivered the event, not when the serial dispatcher ran. * onUnfocused does the same so activeDuration accounting stays accurate (dt = unfocusTimeMs - session.focusTime, both captured on the main thread). Tests: * SessionServiceTests existing assertions about state mutation continue to pass (the FF defaults to off, so the wrapper runs inline; new private handleOnFocus / handleOnUnfocused preserve behavior). * Adds three new assertions: - onFocus dispatches through runOnSerialIOIfBackgroundThreading - onUnfocused dispatches through runOnSerialIOIfBackgroundThreading - rapid onUnfocused -> onFocus burst routes both events through the same gated helper in submission order (mirrors SDK-4505 BackgroundManager burst coverage; protects against activeDuration drift if the events ever raced across an unbounded IO pool). Linear: SDK-4508 Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions · 2026-05-12T14:18:37Z

📊 Diff Coverage Report

Diff Coverage Report (Changed Lines Only)

Gate: aggregate coverage on changed executable lines must be ≥ 80% (JaCoCo line data for lines touched in the diff).

Changed Files Coverage

OneSignalDispatchers.kt: 21/23 touched executable lines (91.3%) (64 touched lines in diff)
SyncJobService.kt: 1/1 touched executable lines (100.0%) (10 touched lines in diff)
OneSignalImp.kt: 0/2 touched executable lines (0.0%) (14 touched lines in diff)
- 2 uncovered touched lines in this file
SessionService.kt: 13/13 touched executable lines (100.0%) (40 touched lines in diff)

Overall (aggregate gate)

35/39 touched executable lines covered (89.7% — requires ≥ 80%)

Per-file detail (informational; gate is aggregate above):

OneSignalImp.kt: 0.0% (2 uncovered touched lines)

📥 View workflow run

claude · 2026-05-12T14:30:09Z

            Logging.debug("SessionService: New session started at ${session.startTime}")
            sessionLifeCycleNotifier.fire { it.onSessionStarted() }
        } else {
            // existing session: just remember the focus time so we can calculate the active time
            // when onUnfocused is called.
-            session.focusTime = _time.currentTimeMillis
+            session.focusTime = focusTimeMs
            sessionLifeCycleNotifier.fire { it.onSessionActive() }
        }
    }

    override fun onUnfocused() {
+        // Same pattern as onFocus: capture the timing on the caller's thread so activeDuration
+        // accounting is unaffected by serial-dispatcher latency, then offload.
+        val unfocusTimeMs = _time.currentTimeMillis
+        runOnSerialIOIfBackgroundThreading {
+            handleOnUnfocused(unfocusTimeMs)
+        }
+    }
+
+    private fun handleOnUnfocused(unfocusTimeMs: Long) {
        val session = this.session
        if (session == null) {
            Logging.warn("SessionService.onUnfocused called before bootstrap; ignoring.")
            return
        }
        // capture the amount of time the app was focused


🔴 After offloading sessionLifeCycleNotifier.fire { onSessionStarted/onSessionActive } to serial IO, InfluenceManager.onSessionStarted (InfluenceManager.kt:57) and onSessionActive (line 61) re-read _applicationService.entryState at execution time rather than capture time. If the main thread mutates entryState (via handleFocus/handleLostFocus) between dispatch and serial-IO execution, notification influence attribution can be misclassified (e.g., a NOTIFICATION_CLICK session reset to UNATTRIBUTED under the APP_OPEN branch of getChannelsToResetByEntryAction). Fix: capture entryState alongside focusTimeMs on the caller thread and pass it through to handleOnFocus/handleOnUnfocused (or thread it via an explicit parameter on ISessionLifecycleHandler).

Extended reasoning...

What the bug is

The PR carefully captures focusTimeMs = _time.currentTimeMillis on the caller's thread before offloading the body of SessionService.onFocus/onUnfocused to runOnSerialIOIfBackgroundThreading, so session.startTime / session.focusTime reflect the moment Android delivered the lifecycle event. But _applicationService.entryState, which is just as time-sensitive, is not captured. It is read at execution time by subscribers inside the deferred fire:

InfluenceManager.onSessionStarted() → restartSessionTrackersIfNeeded(_applicationService.entryState) (InfluenceManager.kt:56-58)

InfluenceManager.onSessionActive() → attemptSessionUpgrade(_applicationService.entryState) (InfluenceManager.kt:60-62)

How it manifests — step-by-step proof

T0 (main): User clicks notification. NotificationLifecycleService.notificationOpened runs on main:

Sets _applicationService.entryState = AppEntryAction.NOTIFICATION_CLICK (NotificationLifecycleService.kt:177)

Calls _influenceManager.onDirectInfluenceFromNotification(notificationId) → attemptSessionUpgrade(NOTIFICATION_CLICK, notificationId) → notification tracker becomes DIRECT.

T0+ε (main): Android fires Activity lifecycle. ApplicationService.handleFocus runs on main, the if (entryState != NOTIFICATION_CLICK) guard at line 391 preserves entryState = NOTIFICATION_CLICK, then applicationLifecycleNotifier.fire { onFocus(false) }.

T0+ε' (main): SessionService.onFocus(false) captures focusTimeMs and submits handleOnFocus(false, focusTimeMs) to the serial IO executor.

T0+δ (main): Before serial IO runs the submitted task, the user briefly backgrounds the activity. ApplicationService.handleLostFocus on main sets entryState = APP_CLOSE (line 375). User refocuses; handleFocus sees entryState == APP_CLOSE != NOTIFICATION_CLICK and overwrites it with APP_OPEN (line 392).

T0+Δ (serial IO): Step 3's submitted block executes. session.isValid == false (cold start), so the !session.isValid branch fires sessionLifeCycleNotifier.fire { it.onSessionStarted() }.

InfluenceManager.onSessionStarted reads _applicationService.entryState → now APP_OPEN, not the NOTIFICATION_CLICK that triggered step 3.

restartSessionTrackersIfNeeded(APP_OPEN) → getChannelsToResetByEntryAction(APP_OPEN) takes a different branch than NOTIFICATION_CLICK would (InfluenceManager.kt:71-84):

APP_OPEN: adds notificationChannelTracker (because entryAction.isAppOpen) and iAMChannelTracker → both get reset.

NOTIFICATION_CLICK: isAppOpen is false, so only iAMChannelTracker is added — the notification tracker is deliberately left alone to preserve direct attribution.

Because the wrong branch ran, the notification tracker's DIRECT influence (set in step 1) is overwritten via setSessionTracker (DIRECT → INDIRECT or UNATTRIBUTED, depending on lastReceivedIds). willChangeSessionTracker returns true because the new influenceType differs from the existing DIRECT, so the change is committed and cacheState() persists the misclassification.

Why pre-PR didn't have this race

Pre-PR, sessionLifeCycleNotifier.fire ran synchronously on the main thread inside SessionService.onFocus. The block executed on the same thread that just set entryState (handleFocus runs the lifecycle fanout synchronously after setting entryState), and because it occupied the main thread, no subsequent main-thread Activity lifecycle callbacks could fire and mutate entryState during execution. The very main-thread block that this PR is fixing (the 5-20s cold-init stall) had the side effect of protecting these reads.

Why the refutation's mitigations don't hold

The refutation argues that direct attribution is set out-of-band via onDirectInfluenceFromNotification, so even if the deferred onSessionStarted reads the wrong entryState, attribution survives. This is wrong: restartSessionTrackersIfNeeded(APP_OPEN) explicitly resets the notification tracker (lines 76-78 add it because entryAction.isAppOpen), and willChangeSessionTracker will return true when transitioning DIRECT → INDIRECT/UNATTRIBUTED, so the DIRECT attribution set by the direct call path is actively overwritten. NOTIFICATION_CLICK protects the tracker by leaving notificationChannel = null (it is not in the reset list at all).

The refutation also calls the race "microseconds-narrow," but that contradicts the PR's own motivation: the entire reason for offloading is that the cold-init dispatcher chain takes 5-20s on the main thread. Even after prewarm() is added, the race window is at minimum the serial-IO queue depth plus dispatcher overhead; under load this can be hundreds of ms, more than enough for a user-paced focus/unfocus sequence (e.g. notification-shade dismiss, biometric prompt return, etc.) to mutate entryState.

How to fix

Capture entryState on the caller thread alongside focusTimeMs and thread it through to the deferred handler:

override fun onFocus(firedOnSubscribe: Boolean) { val focusTimeMs = _time.currentTimeMillis val entryStateSnapshot = _applicationService.entryState runOnSerialIOIfBackgroundThreading { handleOnFocus(firedOnSubscribe, focusTimeMs, entryStateSnapshot) } }

Then either (a) pass entryStateSnapshot as a parameter on ISessionLifecycleHandler.onSessionStarted/onSessionActive (cleaner; explicit), or (b) snapshot it into a thread-local-equivalent state SessionService owns and have InfluenceManager read it via a getter parameter. Option (a) parallels how focusTimeMs is already threaded through and makes the timing invariant explicit at the contract level.

claude · 2026-05-12T14:30:09Z

+        }
+    }
+
+    private fun handleOnFocus(
+        firedOnSubscribe: Boolean,
+        focusTimeMs: Long,
+    ) {
        Logging.log(LogLevel.DEBUG, "SessionService.onFocus() - fired from start: $firedOnSubscribe")



🔴 With SDK_BACKGROUND_THREADING on, SessionService.onFocus now defers handleOnFocus to the serial IO thread while StartupService continues iterating IStartableService.start() on the Default thread — and several of those start() methods (SessionListener, InfluenceManager, SubscriptionManager, InAppMessagesManager, UserRefreshService) call sessionService.subscribe(this). This introduces a cross-thread race where a subscriber added before fire snapshots subscribers can still observe shouldFireOnSubscribe = true afterwards, causing onSessionStarted() to fire twice — duplicate TrackSessionStartOperation enqueues, duplicate UserRefreshService.refreshUser() network calls, duplicate influence/IAM trigger evaluations. Pre-PR this path was deterministic because both handleOnFocus and the subsequent subscribe calls ran on the same startables thread.

Extended reasoning...

Threading topology (post-PR, FF on).

StartupService.scheduleStart() iterates IStartableService.start() sequentially on the Default dispatcher (StartupService.kt:28-37). SessionModule.kt:39-44 registers SessionService before SessionListener, so the loop reaches SessionService.start() first, which calls _applicationService.addApplicationLifecycleHandler(this). ApplicationService.addApplicationLifecycleHandler synchronously invokes handler.onFocus(true) when current != null (ApplicationService.kt:121-128).

Pre-PR, onFocus(true) ran handleOnFocus inline on the Default thread, so the entire body — shouldFireOnSubscribe = true, then sessionLifeCycleNotifier.fire { onSessionStarted() } — completed before SessionService.start() returned. The next startable's subscribe() always observed shouldFireOnSubscribe = true AFTER the fire (with no subscribers yet), giving exactly one onSessionStarted() per handler.

Post-PR, onFocus captures focusTimeMs then does runOnSerialIOIfBackgroundThreading { handleOnFocus(...) } and returns. The Default thread continues to SessionListener.start() → _sessionService.subscribe(this) (and so on for SubscriptionManager, InfluenceManager, UserRefreshService, InAppMessagesManager).

The double-fire interleaving.

SessionService.subscribe(handler):
A1. sessionLifeCycleNotifier.subscribe(handler) → synchronized(subscribers) { add(handler) }
A2. if (shouldFireOnSubscribe) handler.onSessionStarted()

handleOnFocus (in the !session.isValid branch, which is the cold-start path because endSession() invalidates the prior session):
B1. shouldFireOnSubscribe = firedOnSubscribe (true)
B2. sessionLifeCycleNotifier.fire { ... } → synchronized(subscribers) { subscribers.toList() } then for (s in localList) callback(s)

Walk through interleaving A1 → B1 → B2 → A2:

Default thread acquires subscribers monitor, adds the handler, releases. A1's monitor exit synchronizes-with any subsequent acquire of the same monitor.

SerialIO thread sets shouldFireOnSubscribe = true.

SerialIO thread acquires subscribers monitor (after A1's release, so A1 HB B2). Snapshot includes the handler. Releases monitor. Iterates → handler.onSessionStarted() fires (FIRE gradle? #1).

Default thread reads shouldFireOnSubscribe. With B1 HB B2 (program order) and A1 HB B2 (monitor sync), the read at A2 happens after A1. On real Android ART with cache-coherent ARM cores and the intervening monitor ops, the read commonly observes the new true value → handler.onSessionStarted() fires again (FIRE Added possibility to send status 'opened' for a message back to onesignal backend from client source code #2).

Observable impact. SessionListener.onSessionStarted() (SessionListener.kt:45-55) enqueues a TrackSessionStartOperation; double-fire produces duplicate session-start operations against the user, observable as inflated session counts in analytics. UserRefreshService.onSessionStarted() = refreshUser() (a network call) double-fires. InfluenceManager and InAppMessagesManager re-run session-start side effects. The window is small (the time between the SerialIO worker dequeuing the task and the Default thread's next instruction after subscribe() returns), but real and gated on the exact FF this PR targets.

Why the refutation's "narrow theoretical window" argument doesn't hold. The refutation argues that on cold start the SerialIO dispatcher is slow to spin up, so by the time handleOnFocus runs, all subscribe() calls have completed (no double fire). But the same PR series adds OneSignalDispatchers.prewarm() (OneSignalDispatchers.kt:186-238), called from initWithContext before bootstrapServices() and scheduleStart(). prewarm submits empty coroutines to all three dispatchers — including SerialIO — so the SerialIO worker thread is alive and idle by the time scheduleStart() reaches SessionService.start(). The "cold dispatcher absorbs the race window" assumption is invalidated by prewarm.

Re: missed-fire (Race #2). Acknowledging the refutation's and the third verifier's correct point: the alternative interleaving where subscribe.add happens after fire's snapshot establishes a monitor synchronizes-with edge from fire's monitor exit to subscribe's monitor acquire, which gives B1 HB A2 transitively. So A2 always sees the new true value, and there is exactly one fire (via subscribe). A missed fire is not possible. The bug here is strictly the double-fire scenario; the synthesis description's mention of "missed fire" should be disregarded.

Fix. Restore atomicity between the subscriber-registration check and the fire by either: (a) inside subscribe, take the same monitor used by fire while checking shouldFireOnSubscribe, plus mark shouldFireOnSubscribe @Volatile and read it AFTER releasing the subscribers monitor only if the snapshot was taken before B2; or simpler — (b) run the body inline on the caller thread when firedOnSubscribe == true (the synchronous addApplicationLifecycleHandler path), and only defer the natural lifecycle-event onFocus to serial IO; or (c) have fire and subscribe use the same lock so that "subscribed before fire snapshotted" deterministically implies "did not see shouldFireOnSubscribe = true".

claude · 2026-05-12T14:30:09Z

    override fun onStartJob(jobParameters: JobParameters): Boolean {
+        // Android delivers JobService.onStartJob on the main thread. The suspendifyOnIO call
+        // below is the SDK's first IO-pool consumer on cold start in this process, and the
+        // executor + dispatcher + coroutine-scope lazy chain it triggers was producing multi-
+        // second main-thread blocks in production (SDK-4507). prewarm() shifts that cost to a
+        // short-lived daemon thread; idempotent, so it's harmless when initWithContext already
+        // ran. Must happen BEFORE suspendifyOnIO; calling it inside the dispatched block would
+        // be too late because the cold-init cost has already been paid on entry to the helper.
+        OneSignalDispatchers.prewarm()
+


🟡 The OneSignalDispatchers.prewarm() call in SyncJobService.onStartJob is unlikely to deliver the protection the new comment claims (lines 38-46): because nothing sits between prewarm() and suspendifyOnIO, and Kotlin's by lazy defaults to LazyThreadSafetyMode.SYNCHRONIZED, the main thread typically wins the race to the lazy chain (and pays the init cost) or — if the prewarm daemon wins — blocks on the synchronized monitor for the full init duration. Android's ANR detector treats a monitor-blocked main thread the same as one running slow code, so the cold-start SDK-4507 block this call site specifically targets is not actually prevented here. Consider either inlining the prewarm body into the start of the suspendifyOnIO block (so it always runs off-main), removing the call (the rest of the PR is the real fix), or at minimum updating the comment to reflect the probabilistic semantics.

Extended reasoning...

What the bug is

SyncJobService.onStartJob is delivered on the main thread. The new code is:

override fun onStartJob(jobParameters: JobParameters): Boolean { OneSignalDispatchers.prewarm() // 1 suspendifyOnIO { ... } // 2 return true }

The accompanying comment asserts: "prewarm() shifts that cost to a short-lived daemon thread." But in this specific call site there is zero synchronous work between (1) and (2), which is exactly what the prewarm optimization needs to be effective. Contrast with OneSignalImp.initWithContext, where prewarm() is followed by synchronized(initLock) { ... }, ensureApplicationServiceStarted(), etc. — substantial wall-clock work that gives the daemon time to win the race.

Why the optimization is ineffective here

All the relevant properties in OneSignalDispatchers (IO, Default, SerialIO, IOScope, DefaultScope, SerialIOScope, ioExecutor, defaultExecutor, serialIOExecutor) are declared by lazy { ... } with no explicit thread-safety mode, which means Kotlin's default LazyThreadSafetyMode.SYNCHRONIZED applies: whichever thread enters the initializer first owns a synchronized monitor; every other thread blocks on that monitor until init completes.

The race timeline on a cold process (the documented SDK-4507 scenario where JobScheduler resurrects a process without initWithContext having run):

Main: prewarm() → Thread.start() returns immediately. The daemon is in NEW/RUNNABLE, awaiting OS scheduling.

Main: suspendifyOnIO { ... } → suspendifyWithCompletion → if (ThreadingMode.useBackgroundThreading) OneSignalDispatchers.launchOnIO { ... } → accesses IOScope (lazy) → IO (lazy) → ioExecutor (lazy).

The wall-clock gap between Thread.start() returning and the main thread reaching the lazy chain is on the order of nanoseconds (a few method dispatches). The daemon thread needs to be scheduled by the OS and then execute its own launchOnIO to reach the same lazy chain.

Further compounding the race: the prewarm thread is created with priority = Thread.NORM_PRIORITY - 2 (see OneSignalDispatchers.kt:236), which the OS scheduler treats as a hint to favor the main thread.

The "either outcome is bad" key observation

Regardless of who wins the race, the main thread blocks for ~init_time:

Main wins → main runs the slow initializer inline.

Daemon wins → main reaches the lazy access a few microseconds later and blocks on the synchronized monitor until the daemon finishes the 5–20 s init. Android's ANR detector treats a WAITING/monitor-blocked main thread the same as one running slow code.

Step-by-step proof on a cold process

Process starts because JobScheduler is firing SyncJobService. OneSignalImp.initWithContext has not run; prewarmStarted == false.

Main enters onStartJob. Calls prewarm(): it sets prewarmStarted = true, constructs Thread("OneSignal-prewarm", priority=NORM_PRIORITY-2), calls .start(), returns. Daemon is RUNNABLE but not yet on a CPU.

Main proceeds to suspendifyOnIO { ... }. Within a few ns: enters suspendifyWithCompletion, checks ThreadingMode.useBackgroundThreading (the SDK-4507 cohort = true), calls OneSignalDispatchers.launchOnIO. That function body is return IOScope.launch { block() } — touching IOScope triggers by lazy { CoroutineScope(SupervisorJob() + IO) }, which in turn touches IO, which in turn touches ioExecutor. The first of these synchronized monitors that the daemon hasn't already entered, main enters.

Case A — main wins the monitor: main constructs ioExecutor (ThreadPoolExecutor, allocating its internal LinkedBlockingQueue, spawning core threads, etc.), wraps it in asCoroutineDispatcher(), constructs IOScope, then calls .launch { block() } which hits ThreadPoolExecutor.execute → LinkedBlockingQueue.offer — exactly the stack the PR description cites. Main is blocked for the full init duration.

Case B — daemon wins the monitor (less common because of priority + the head-start gap): daemon runs the same construction. Main reaches the lazy access ~μs later, finds the monitor held, blocks in WAITING until daemon releases. Same wall-clock cost on main.

What about the refutation?

The refutation argues this is not actionable because (a) the common case is idempotent (already-initialized SDK → prewarmStarted == true → no-op, lazy chain already warm), (b) the code is not worse than pre-PR, and (c) the bug is really a critique of the comment. Those points are fair:

The code is correct: no crash, no incorrect behavior.

It does not regress anything: pre-PR, the main thread blocked 100% of the time on cold start; post-PR, it still blocks (deterministically when main wins, monitor-blocked when daemon wins).

But the reason this is still worth surfacing rather than abstaining: the PR description and the new comment at SyncJobService.kt:38-46 both frame this prewarm() call as a real piece of the SDK-4507 fix, and the JobService cold-start scenario is precisely the one cited in the description ("Android cold-starts a process just to run SyncJobService"). A future maintainer reading the comment will reasonably believe the SyncJobService path is covered, when in practice the cold-process block remains. The real protection in this PR comes from runOnSerialIOIfBackgroundThreading on the lifecycle handlers, not from this prewarm() call.

How to fix

Options (any one is sufficient):

Inline the warming into the dispatched block: move all of onStartJob's body inside the suspendifyOnIO (already does that today) — but skip prewarm() entirely here, since by the time suspendifyOnIO runs the cold-init cost has been paid on the IO worker thread anyway. The current prewarm() is dead weight in the cold-process case.

Use a non-ANR-sensitive thread for the JobService entry: spawn a worker thread (or use runOnSerialIOIfBackgroundThreading) for the entire onStartJob body, so the cold-init cost — whoever pays it — never lands on main.

At minimum, update the comment to acknowledge the race semantics: prewarm is only effective when meaningful synchronous work follows it on the caller's thread, which is not the case here. This keeps a future maintainer from assuming SyncJobService is covered when it isn't.

The third option is the lightest touch and removes the misleading load-bearing claim without changing behavior.

abdulraqeeb33 · 2026-05-12T14:33:52Z

Closing — the SessionService.onFocus / onUnfocused offload from this PR has been consolidated into #2644 (the unified onFocus / onUnfocused offload PR). All five lifecycle handlers (BackgroundManager, NotificationsManager, NotificationPermissionController, FeatureFlagsRefreshService, SessionService) now move together on the same FF in #2644.

AR Abdul Azeez and others added 10 commits May 11, 2026 13:20

abdulraqeeb33 force-pushed the ar/sdk-4508 branch from c91e0ab to 0193fee Compare May 12, 2026 14:18

claude Bot reviewed May 12, 2026

View reviewed changes

abdulraqeeb33 force-pushed the ar/sdk-4506-notifications-onfocus-anr branch from 520ae2c to 0621414 Compare May 12, 2026 14:32

abdulraqeeb33 closed this May 12, 2026

abdulraqeeb33 deleted the ar/sdk-4508 branch May 12, 2026 14:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: SDK-4508 offload SessionService.onFocus / onUnfocused off the main thread#2646

fix: SDK-4508 offload SessionService.onFocus / onUnfocused off the main thread#2646
abdulraqeeb33 wants to merge 10 commits into
ar/sdk-4506-notifications-onfocus-anrfrom
ar/sdk-4508

abdulraqeeb33 commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026 •

edited

Loading

Uh oh!

claude Bot May 12, 2026

Uh oh!

claude Bot May 12, 2026

Uh oh!

claude Bot May 12, 2026

Uh oh!

abdulraqeeb33 commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abdulraqeeb33 commented May 12, 2026

Summary

Fix

Timing semantics preserved

Base branch

Test plan

Uh oh!

github-actions Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊 Diff Coverage Report

Diff Coverage Report (Changed Lines Only)

Changed Files Coverage

Overall (aggregate gate)

Uh oh!

claude Bot May 12, 2026

Choose a reason for hiding this comment

What the bug is

How it manifests — step-by-step proof

Why pre-PR didn't have this race

Why the refutation's mitigations don't hold

How to fix

Uh oh!

claude Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 12, 2026

Choose a reason for hiding this comment

What the bug is

Why the optimization is ineffective here

The "either outcome is bad" key observation

Step-by-step proof on a cold process

What about the refutation?

How to fix

Uh oh!

abdulraqeeb33 commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented May 12, 2026 •

edited

Loading