fix: SDK-4508 offload SessionService.onFocus / onUnfocused off the main thread#2646
fix: SDK-4508 offload SessionService.onFocus / onUnfocused off the main thread#2646abdulraqeeb33 wants to merge 10 commits into
Conversation
…off the main thread ApplicationService.handleFocus dispatches onFocus/onUnfocused on the main thread (via Activity.onStart -> dispatchActivityStarted). BackgroundManager was previously calling JobScheduler.cancel/schedule synchronously from those callbacks. JobScheduler operations are synchronous Binder transactions that can stall for many seconds on some devices (notably Xiaomi/MIUI under power-save), producing main-thread ANRs. Reported example: 20,796 ms main-thread block on a Xiaomi 25078RA3EL / Android 15 with sample insertId ycae33cjpu6gcyut on SDK v5.9.0. Route cancelSyncTask() and scheduleBackground() through suspendifyOnIO so the actual JobScheduler call happens on OneSignalDispatchers.IO (or Dispatchers.IO when the BACKGROUND_THREADING flag is off). State mutations inside cancelSyncTask/scheduleSyncTask are already guarded by synchronized(lock); JobScheduler.cancel is idempotent and scheduleSyncServiceAsJob is de-duped by isJobIdRunning(), so running on a multi-threaded pool is safe. Co-authored-by: Cursor <cursoragent@cursor.com>
Adds the regression test promised in the PR TODO. Verifies that BackgroundManager.onFocus() and onUnfocused() route through OneSignalDispatchers.launchOnIO (i.e. via suspendifyOnIO) instead of running JobScheduler operations inline on the calling thread, which was the source of the 20s main-thread ANR reported in SDK-4505. Mirrors the existing pattern in ThreadUtilsFeatureFlagTests (mockkObject(OneSignalDispatchers) + verify launchOnIO is invoked exactly once per lifecycle callback) with BACKGROUND_THREADING toggled on, since that is the configuration the production ANR was reported under. Brings diff coverage on the touched executable lines from 0% to 100%, satisfying the PR diff-coverage gate. Co-authored-by: Cursor <cursoragent@cursor.com>
…patcher
The initial SDK-4505 fix routed BackgroundManager.onFocus/onUnfocused through
suspendifyOnIO, which dispatches onto the multi-threaded OneSignal IO pool
(core 2 / max 3). That solved the main-thread ANR but left a subtler race for
rapid lifecycle bursts: an `unfocused -> focused` pair submitted in close
succession can be picked up by two IO worker threads and executed in the
opposite order, leaving a SyncJob scheduled while the app is foregrounded
(or wiping a freshly scheduled job when the user has just backgrounded).
Each individual operation is internally safe (synchronized(lock), the
nextScheduledSyncTimeMs dedupe, and the isJobIdRunning short-circuit) but
last-writer-wins across the pool is not preserved, and the problem will
worsen as these handlers grow per-event work (session timing, focus
counters, analytics) that can't be reconciled from a single state snapshot.
Introduces a dedicated single-thread dispatcher for order-sensitive work:
- OneSignalDispatchers.SerialIO: a daemon Executors.newSingleThreadExecutor
named "OneSignal-SerialIO", with a Dispatchers.IO.limitedParallelism(1)
fallback if our executor fails to start. Submission order on the caller
thread equals execution order on this thread, so callers no longer need
their own synchronization to preserve sequencing.
- launchOnSerialIO(block) on OneSignalDispatchers.
- suspendifyOnSerialIO(block) helper in ThreadUtils, mirroring the existing
suspendifyOnIO shape. Always routes through OneSignalDispatchers
regardless of ThreadingMode.useBackgroundThreading, since the ordering
guarantee is the whole point of this entry point and the single thread
carries none of the resource concerns the BACKGROUND_THREADING FF gates.
BackgroundManager.onFocus/onUnfocused now use suspendifyOnSerialIO. Refactor
getStatus() into small executorStatus / scopeStatus helpers to extend the
diagnostic output to the new dispatcher without tripping detekt's LongMethod
/ ComplexMethod / StringLiteralDuplication checks (the refactor incidentally
brings the function well under those thresholds).
IOMockHelper stubs suspendifyOnSerialIO and launchOnSerialIO alongside the
existing entry points, so any spec using the helper transparently waits on
serial work via awaitIO().
BackgroundManagerTests now verifies launchOnSerialIO is the dispatcher
target, and adds a `rapid unfocus -> focus burst` case that uses verifyOrder
to assert both events are submitted to the serial dispatcher in
main-thread submission order.
Co-authored-by: Cursor <cursoragent@cursor.com>
…ing FF
Wraps the suspendifyOnSerialIO offload added in the prior two commits in
`if (ThreadingMode.useBackgroundThreading) { ... } else { ... }` so the
fix can ship as an A/B against the legacy inline path rather than as an
unconditional change:
FF on -> suspendifyOnSerialIO { cancelSyncTask() / scheduleBackground() }
Routes through OneSignalDispatchers.SerialIO, off the main
thread, with submission order preserved.
FF off -> cancelSyncTask() / scheduleBackground()
Legacy inline path; retains the SDK-4505 ANR for the control
cohort so we can measure the fix's impact.
The existing `sdk_background_threading` FF is APP_STARTUP-activated (see
FeatureFlag.kt) so a given session is latched on one path and won't
bounce mid-run, which keeps the comparison clean. The production ANR
sample was on FF=ON before this PR (because the legacy code called
JobScheduler.cancel directly, bypassing the threading helpers), so this
gate is repurposing the flag to also control "does BackgroundManager
offload its lifecycle calls" — once the FF=ON cohort confirms the ANR
rate drop, we'll flip the FF globally or remove the gate.
BackgroundManagerTests now covers both arms:
- 3 existing cases run with useBackgroundThreading=true and assert
OneSignalDispatchers.launchOnSerialIO is invoked (the offload + the
rapid-burst ordering case).
- 2 new cases run with useBackgroundThreading=false and assert
launchOnSerialIO is NOT invoked, i.e. onFocus/onUnfocused execute on
the calling thread. A small applicationServiceWithStubbedJobScheduler
helper stubs IApplicationService.appContext.getSystemService so the
inline path can complete without a Robolectric Context.
Co-authored-by: Cursor <cursoragent@cursor.com>
…etStatus helpers
Brings diff coverage above the 80% gate (50.0% -> 92.5%) by exercising
the new SerialIO plumbing introduced earlier in the PR.
OneSignalDispatchersTests:
- SerialIO dispatcher executes work on a background thread.
- launchOnSerialIO runs N queued tasks on a single thread in
submission order (the SDK-4505 ordering contract).
- An exception in one SerialIO task does not poison the queue for
subsequent tasks.
- getStatus() reports the new SerialIO Executor / Scope lines.
- getPerformanceMetrics() reports the new SerialIO Queue line and
folds serialCompleted into the total.
- executorStatus / scopeStatus return "Active" / "Shutdown" /
"Cancelled" on the happy paths and the "<name> Not initialized
<message-or-Unknown error>" string when the underlying
isShutdown / isActive lambda throws.
ThreadUtilsFeatureFlagTests:
- suspendifyOnSerialIO always routes through
OneSignalDispatchers.launchOnSerialIO regardless of
ThreadingMode.useBackgroundThreading. The serial ordering
guarantee is the whole point of this entry point, so we want a
regression test that pins down that contract.
- suspendifyOnSerialIO swallows exceptions thrown inside the block
(matches the production try/catch behavior).
To make the defensive catch branches in executorStatus / scopeStatus
reachable from unit tests, they are promoted from `private inline` to
`internal`. The catch only fires when the underlying lazy executor /
scope is in a failed state (re-throws on every isShutdown / isActive
access), which is not triggerable from a unit test without forcing
Executors.newSingleThreadExecutor to fail to construct. Exposing the
helpers and passing throwing lambdas in the test is the cheapest way
to cover that branch without simulating a JVM-level failure.
The remaining 4 uncovered lines on OneSignalDispatchers.kt are the
catch blocks of the lazy executor / dispatcher / getPerformanceMetrics
initializers themselves, which are genuinely unreachable from a unit
test (require Executors / asCoroutineDispatcher / the entire
performance-metrics try{} body to throw) - those are kept as defensive
programming.
Co-authored-by: Cursor <cursoragent@cursor.com>
…ated lifecycle offload
Introduces a small ThreadUtils helper that conditionally dispatches a
non-suspending block to OneSignalDispatchers.SerialIO when the
SDK_BACKGROUND_THREADING remote feature flag is on, and runs it inline
on the caller thread when the flag is off.
FF on -> suspendifyOnSerialIO { block() } // off-main, single-thread, ordered
FF off -> block() // legacy inline path
This is the same gating pattern SDK-4505 applied inline at the
BackgroundManager call sites, lifted into a reusable helper so other
order-sensitive lifecycle handlers (NotificationsManager.onFocus in
the follow-up commit; SessionService and friends later) can opt into
it with one wrapper call instead of repeating the if/else everywhere.
Differs from the existing entry points by design:
- suspendifyOnIO -> always offloads, multi-thread IO pool (no
ordering guarantee).
- suspendifyOnSerialIO -> always offloads, single-thread serial
(ordering guarantee; FF-agnostic on purpose).
- runOnSerialIOIfBackgroundThreading -> FF-gated. Used when we want an
A/B comparison of the new behavior against
the previous inline behavior in production.
The block is non-suspending because the FF-off branch executes on the
caller's thread - a suspending block there would force a runBlocking,
which defeats the purpose of the comparison.
IOMockHelper stubs the helper to run inline on the test thread by
default so existing specs keep their observable behavior; tests that
want to exercise the FF-on (offload) branch can override the stub.
Regression tests on ThreadUtilsFeatureFlagTests cover both branches:
- FF on -> launchOnSerialIO invoked, block not run inline
- FF off -> launchOnSerialIO not invoked, block runs on caller thread
Co-authored-by: Cursor <cursoragent@cursor.com>
…hind sdk_background_threading FF
Production OTel sample insertId 9qy5s0ta0cwqwmb0 shows a 30.5 s
main-thread ANR on vivo I2306 / Android 15:
Application Not Responding: Main thread blocked for 30516ms
at SQLiteConnection.nativeExecuteForLong
... (WorkDatabase_Impl / WorkManagerImpl.enqueueUniqueWork)
at NotificationRestoreWorkManager.beginEnqueueingWork
at NotificationsManager.refreshNotificationState
at NotificationsManager.onFocus
at ApplicationService.handleFocus
at ApplicationService.onActivityStarted
at Application.dispatchActivityStarted
at Activity.onStart
Same Activity-lifecycle fan-out as SDK-4505: Activity.onStart synchronously
fires every IApplicationLifecycleHandler.onFocus on the main thread.
NotificationsManager.onFocus invokes refreshNotificationState() which
calls NotificationRestoreWorkManager.beginEnqueueingWork(), which:
1. Lazily constructs WorkManager (opens / migrates the SQLite store on
first call).
2. Synchronously writes to the WorkManager DB via enqueueUniqueWork.
Both are SQLite disk I/O on the main thread. Under storage contention,
low memory, or OEM background daemons holding the fs lock, the call can
stall for 10s of seconds and trip an ANR.
This is the first foreground event after process start - the restored
flag inside beginEnqueueingWork short-circuits on subsequent calls, so
only the first onFocus eats the stall. (Subsequent rapid focus events
still benefit from the helper because they go through it consistently;
the stall just doesn't repeat per session.)
Fix: route refreshNotificationState() through the new
runOnSerialIOIfBackgroundThreading helper (added in the preceding
commit). On the SDK_BACKGROUND_THREADING FF=on cohort, the work moves
to OneSignalDispatchers.SerialIO - the same single-thread executor
BackgroundManager already uses, so onFocus work across both handlers
stays globally ordered (no risk of a stale NotificationsManager refresh
interleaving with a fresh BackgroundManager schedule/cancel pair). On
the FF=off cohort the inline behavior is retained, mirroring the
SDK-4505 rollout strategy and giving us the A/B signal we need to
confirm ANR rate drops on the FF=on cohort.
Regression tests on NotificationsManagerTests assert:
- onFocus dispatches refreshNotificationState through the gated helper
(the contract that the call site stays off-main on FF=on).
- A rapid onFocus burst routes both events through the gated helper
in submission order (defense against future per-event work being
added here without re-introducing the reorder hazard that the
SerialIO dispatcher prevents).
Co-authored-by: Cursor <cursoragent@cursor.com>
…ionController, FeatureFlagsRefreshService) behind sdk_background_threading FF
Expand SDK-4506's scope to cover the other two lifecycle handlers that the
ANR-dump analysis from logs/2026-05-12 surfaced as also stalling the main
thread on cold start under `sdk_background_threading`:
* NotificationPermissionController polling lifecycle listener — reads
`_configModelStore.model.foregroundFetchNotificationPermissionInterval`
and calls `pollingWaiter.wake()`, which dispatches a coroutine resume
onto the IO pool. On cold start that hits the dispatcher / executor
lazy chain inside OneSignalDispatchers and the construction cost is
paid on the calling (main) thread.
* FeatureFlagsRefreshService.onFocus / onUnfocused — calls
`OneSignalDispatchers.launchOnIO` directly via `restartForegroundPolling`,
same chain, same stall.
Both move to `runOnSerialIOIfBackgroundThreading` (introduced earlier in
this PR for NotificationsManager.onFocus). Identical rollout shape:
* FF on -> SerialIO single-thread executor, off-main, ordered globally
with BackgroundManager (SDK-4505) and NotificationsManager
(this PR).
* FF off -> inline on the lifecycle main thread (legacy behavior, retains
the ANR for the control cohort).
`FeatureFlagsRefreshService.onUnfocused` now qualifies the `synchronized`
receiver as `synchronized(this@FeatureFlagsRefreshService)` so the lambda
locks on the service instance — the same monitor `restartForegroundPolling`
takes — rather than on the (no-receiver) lambda object.
Tests:
* :core FeatureFlagsRefreshServiceTests asserts onFocus / onUnfocused
route through `runOnSerialIOIfBackgroundThreading` (3 invocations
across start + direct onFocus + direct onUnfocused).
* :notifications NotificationPermissionControllerTests asserts the
polling lifecycle listener's onFocus / onUnfocused both dispatch
through the helper.
These two call sites originally landed in the SDK-4507 PR (#2645) but
share the same root cause and rollout strategy as the
NotificationsManager.onFocus offload, so consolidating them here keeps
the FF rollout matrix one-knob simple and leaves SDK-4507 (#2645) to
focus purely on the prewarm fix.
:OneSignal:core + :OneSignal:notifications detekt + full unit suites
green.
Co-authored-by: Cursor <cursoragent@cursor.com>
…o avoid cold-start ANRs
ANR-dump analysis (logs/2026-05-12, 500 entries, all on
sdk_background_threading) attributes 47 / 500 ANRs (9.4%) to the SDK's
own background-threading helpers stalling the main thread on cold start.
The three call sites in the bucket are now all routed through
runOnSerialIOIfBackgroundThreading by the SDK-4505 / SDK-4506 work, but
the deeper root cause is that the very first IO consumer (whichever
caller wins the race) pays the cost of constructing the entire
OneSignalDispatchers lazy chain on its thread:
ThreadPoolExecutor.execute -> LinkedBlockingQueue.offer
CoroutineDispatcher.dispatch -> kotlinx.coroutines first-launch
OneSignalDispatchers.IOScope.<init> (by lazy)
OneSignalDispatchers.IO (by lazy)
OneSignalDispatchers.ioExecutor (by lazy)
That includes the kotlinx.coroutines MainDispatcherFactory ServiceLoader
scan, executor + thread-factory construction, dispatcher wrapping, and
SupervisorJob / CoroutineScope wiring. Under sdk_background_threading
the first caller is typically an Activity-lifecycle handler or
JobService.onStartJob - both on the main thread - and OTel records
5-20s blocks before the watchdog fires.
OneSignalDispatchers.prewarm() spawns a dedicated short-lived
"OneSignal-prewarm" daemon thread that submits one empty launch on each
of IO / Default / SerialIO. That single thread pays the lazy-init cost
end-to-end so the next production caller - even on the main thread -
only sees the cheap "submit to already-constructed executor" path.
* Idempotent: double-checked-locked prewarmStarted flag, so repeat
calls from init / suspend init / JobService.onStartJob no-op cheaply.
* Fire-and-forget: failures log and swallow; the existing
Dispatchers.IO / SerialIO fallback paths in [IO] / [SerialIO] still
apply if anything goes wrong, so a failed prewarm just means the
first real caller pays the original cost.
* Daemon thread at NORM_PRIORITY - 2 so prewarm never blocks process
exit or starves UI work.
Called from:
* OneSignalImp.initWithContext(context, appId) (sync variant)
* OneSignalImp.initWithContextSuspend(context, appId) (suspend variant,
used by re-entrant suspend callers)
* SyncJobService.onStartJob BEFORE suspendifyOnIO, because the
JobService can fire before the host app's initWithContext runs.
Tests (:core OneSignalDispatchersTests):
* prewarm returns immediately on the caller and the daemon thread
brings IO / Default / SerialIO + their scopes to Active.
* prewarm is idempotent - second call does not spawn another
OneSignal-prewarm thread (verified via thread-name scan).
Scope reduction: the remaining-onFocus-handlers part of the original
SDK-4507 change (NotificationPermissionController + FeatureFlagsRefreshService
runOnSerialIOIfBackgroundThreading wrappers) was moved up the stack to
SDK-4506 (#2644), where it sits next to NotificationsManager.onFocus
since all three share the same FF-gated rollout shape. This PR is now
focused purely on the prewarm fix.
:OneSignal:core detekt + full unit suite green.
Co-authored-by: Cursor <cursoragent@cursor.com>
…in thread
ANR-dump analysis (logs/2026-05-12) attributes 25 of 500 ANRs (5.0%) to
`SessionService.onFocus` blocking the main thread. Stack signature:
java.util.concurrent.LinkedBlockingQueue.offer
java.util.concurrent.ThreadPoolExecutor.execute
kotlinx.coroutines.scheduling.CoroutineScheduler.dispatch
com.onesignal.common.threading.OneSignalDispatchers.<lazy chain>
... operation-repo / IAM trigger eval handlers ...
com.onesignal.session.internal.session.impl.b$c.invoke (sessionLifeCycleNotifier.fire)
com.onesignal.session.internal.session.impl.b.onFocus (SessionService.onFocus)
com.onesignal.core.internal.application.impl.a.handleFocus (fanout on main)
SessionService.onFocus runs synchronously on the main thread via
ApplicationService.handleFocus -> applicationLifecycleNotifier.fire. Inside,
sessionLifeCycleNotifier.fire { onSessionStarted/Active } invokes the
registered session-lifecycle handlers (operation repo, IAM trigger eval,
etc.) synchronously; the first one to touch OneSignalDispatchers pays the
cold-init cost on the main thread (5–20s blocks before the watchdog fires).
This is the same SDK-4505 / SDK-4506 / SDK-4507 fanout pattern -- but
on a pre-existing handler that predates background threading.
Fix: wrap onFocus / onUnfocused bodies in
runOnSerialIOIfBackgroundThreading (same gated helper SDK-4505 / SDK-4506
use). FF off retains the original inline semantics for the rollout control
cohort and the existing synchronous-onFocus tests; FF on routes through
the dedicated serial IO dispatcher so the fanout no longer runs on main.
Timing semantics:
* onFocus captures _time.currentTimeMillis on the caller's thread
BEFORE the wrapper and passes it into the deferred handleOnFocus
block, so session.startTime / session.focusTime still reflect when
Android delivered the event, not when the serial dispatcher ran.
* onUnfocused does the same so activeDuration accounting stays
accurate (dt = unfocusTimeMs - session.focusTime, both captured
on the main thread).
Tests:
* SessionServiceTests existing assertions about state mutation continue
to pass (the FF defaults to off, so the wrapper runs inline; new
private handleOnFocus / handleOnUnfocused preserve behavior).
* Adds three new assertions:
- onFocus dispatches through runOnSerialIOIfBackgroundThreading
- onUnfocused dispatches through runOnSerialIOIfBackgroundThreading
- rapid onUnfocused -> onFocus burst routes both events through the
same gated helper in submission order (mirrors SDK-4505
BackgroundManager burst coverage; protects against activeDuration
drift if the events ever raced across an unbounded IO pool).
Linear: SDK-4508
Co-authored-by: Cursor <cursoragent@cursor.com>
c91e0ab to
0193fee
Compare
📊 Diff Coverage ReportDiff Coverage Report (Changed Lines Only)Gate: aggregate coverage on changed executable lines must be ≥ 80% (JaCoCo line data for lines touched in the diff). Changed Files Coverage
Overall (aggregate gate)35/39 touched executable lines covered (89.7% — requires ≥ 80%) Per-file detail (informational; gate is aggregate above):
|
| Logging.debug("SessionService: New session started at ${session.startTime}") | ||
| sessionLifeCycleNotifier.fire { it.onSessionStarted() } | ||
| } else { | ||
| // existing session: just remember the focus time so we can calculate the active time | ||
| // when onUnfocused is called. | ||
| session.focusTime = _time.currentTimeMillis | ||
| session.focusTime = focusTimeMs | ||
| sessionLifeCycleNotifier.fire { it.onSessionActive() } | ||
| } | ||
| } | ||
|
|
||
| override fun onUnfocused() { | ||
| // Same pattern as onFocus: capture the timing on the caller's thread so activeDuration | ||
| // accounting is unaffected by serial-dispatcher latency, then offload. | ||
| val unfocusTimeMs = _time.currentTimeMillis | ||
| runOnSerialIOIfBackgroundThreading { | ||
| handleOnUnfocused(unfocusTimeMs) | ||
| } | ||
| } | ||
|
|
||
| private fun handleOnUnfocused(unfocusTimeMs: Long) { | ||
| val session = this.session | ||
| if (session == null) { | ||
| Logging.warn("SessionService.onUnfocused called before bootstrap; ignoring.") | ||
| return | ||
| } | ||
| // capture the amount of time the app was focused |
There was a problem hiding this comment.
🔴 After offloading sessionLifeCycleNotifier.fire { onSessionStarted/onSessionActive } to serial IO, InfluenceManager.onSessionStarted (InfluenceManager.kt:57) and onSessionActive (line 61) re-read _applicationService.entryState at execution time rather than capture time. If the main thread mutates entryState (via handleFocus/handleLostFocus) between dispatch and serial-IO execution, notification influence attribution can be misclassified (e.g., a NOTIFICATION_CLICK session reset to UNATTRIBUTED under the APP_OPEN branch of getChannelsToResetByEntryAction). Fix: capture entryState alongside focusTimeMs on the caller thread and pass it through to handleOnFocus/handleOnUnfocused (or thread it via an explicit parameter on ISessionLifecycleHandler).
Extended reasoning...
What the bug is
The PR carefully captures focusTimeMs = _time.currentTimeMillis on the caller's thread before offloading the body of SessionService.onFocus/onUnfocused to runOnSerialIOIfBackgroundThreading, so session.startTime / session.focusTime reflect the moment Android delivered the lifecycle event. But _applicationService.entryState, which is just as time-sensitive, is not captured. It is read at execution time by subscribers inside the deferred fire:
InfluenceManager.onSessionStarted()→restartSessionTrackersIfNeeded(_applicationService.entryState)(InfluenceManager.kt:56-58)InfluenceManager.onSessionActive()→attemptSessionUpgrade(_applicationService.entryState)(InfluenceManager.kt:60-62)
How it manifests — step-by-step proof
- T0 (main): User clicks notification.
NotificationLifecycleService.notificationOpenedruns on main:- Sets
_applicationService.entryState = AppEntryAction.NOTIFICATION_CLICK(NotificationLifecycleService.kt:177) - Calls
_influenceManager.onDirectInfluenceFromNotification(notificationId)→attemptSessionUpgrade(NOTIFICATION_CLICK, notificationId)→ notification tracker becomes DIRECT.
- Sets
- T0+ε (main): Android fires Activity lifecycle.
ApplicationService.handleFocusruns on main, theif (entryState != NOTIFICATION_CLICK)guard at line 391 preservesentryState = NOTIFICATION_CLICK, thenapplicationLifecycleNotifier.fire { onFocus(false) }. - T0+ε' (main):
SessionService.onFocus(false)capturesfocusTimeMsand submitshandleOnFocus(false, focusTimeMs)to the serial IO executor. - T0+δ (main): Before serial IO runs the submitted task, the user briefly backgrounds the activity.
ApplicationService.handleLostFocuson main setsentryState = APP_CLOSE(line 375). User refocuses;handleFocusseesentryState == APP_CLOSE != NOTIFICATION_CLICKand overwrites it withAPP_OPEN(line 392). - T0+Δ (serial IO): Step 3's submitted block executes.
session.isValid == false(cold start), so the!session.isValidbranch firessessionLifeCycleNotifier.fire { it.onSessionStarted() }. InfluenceManager.onSessionStartedreads_applicationService.entryState→ nowAPP_OPEN, not theNOTIFICATION_CLICKthat triggered step 3.restartSessionTrackersIfNeeded(APP_OPEN)→getChannelsToResetByEntryAction(APP_OPEN)takes a different branch than NOTIFICATION_CLICK would (InfluenceManager.kt:71-84):- APP_OPEN: adds
notificationChannelTracker(becauseentryAction.isAppOpen) andiAMChannelTracker→ both get reset. - NOTIFICATION_CLICK:
isAppOpenis false, so onlyiAMChannelTrackeris added — the notification tracker is deliberately left alone to preserve direct attribution.
- APP_OPEN: adds
- Because the wrong branch ran, the notification tracker's DIRECT influence (set in step 1) is overwritten via
setSessionTracker(DIRECT → INDIRECT or UNATTRIBUTED, depending onlastReceivedIds).willChangeSessionTrackerreturns true because the new influenceType differs from the existing DIRECT, so the change is committed andcacheState()persists the misclassification.
Why pre-PR didn't have this race
Pre-PR, sessionLifeCycleNotifier.fire ran synchronously on the main thread inside SessionService.onFocus. The block executed on the same thread that just set entryState (handleFocus runs the lifecycle fanout synchronously after setting entryState), and because it occupied the main thread, no subsequent main-thread Activity lifecycle callbacks could fire and mutate entryState during execution. The very main-thread block that this PR is fixing (the 5-20s cold-init stall) had the side effect of protecting these reads.
Why the refutation's mitigations don't hold
The refutation argues that direct attribution is set out-of-band via onDirectInfluenceFromNotification, so even if the deferred onSessionStarted reads the wrong entryState, attribution survives. This is wrong: restartSessionTrackersIfNeeded(APP_OPEN) explicitly resets the notification tracker (lines 76-78 add it because entryAction.isAppOpen), and willChangeSessionTracker will return true when transitioning DIRECT → INDIRECT/UNATTRIBUTED, so the DIRECT attribution set by the direct call path is actively overwritten. NOTIFICATION_CLICK protects the tracker by leaving notificationChannel = null (it is not in the reset list at all).
The refutation also calls the race "microseconds-narrow," but that contradicts the PR's own motivation: the entire reason for offloading is that the cold-init dispatcher chain takes 5-20s on the main thread. Even after prewarm() is added, the race window is at minimum the serial-IO queue depth plus dispatcher overhead; under load this can be hundreds of ms, more than enough for a user-paced focus/unfocus sequence (e.g. notification-shade dismiss, biometric prompt return, etc.) to mutate entryState.
How to fix
Capture entryState on the caller thread alongside focusTimeMs and thread it through to the deferred handler:
override fun onFocus(firedOnSubscribe: Boolean) {
val focusTimeMs = _time.currentTimeMillis
val entryStateSnapshot = _applicationService.entryState
runOnSerialIOIfBackgroundThreading {
handleOnFocus(firedOnSubscribe, focusTimeMs, entryStateSnapshot)
}
}Then either (a) pass entryStateSnapshot as a parameter on ISessionLifecycleHandler.onSessionStarted/onSessionActive (cleaner; explicit), or (b) snapshot it into a thread-local-equivalent state SessionService owns and have InfluenceManager read it via a getter parameter. Option (a) parallels how focusTimeMs is already threaded through and makes the timing invariant explicit at the contract level.
| } | ||
| } | ||
|
|
||
| private fun handleOnFocus( | ||
| firedOnSubscribe: Boolean, | ||
| focusTimeMs: Long, | ||
| ) { | ||
| Logging.log(LogLevel.DEBUG, "SessionService.onFocus() - fired from start: $firedOnSubscribe") | ||
|
|
There was a problem hiding this comment.
🔴 With SDK_BACKGROUND_THREADING on, SessionService.onFocus now defers handleOnFocus to the serial IO thread while StartupService continues iterating IStartableService.start() on the Default thread — and several of those start() methods (SessionListener, InfluenceManager, SubscriptionManager, InAppMessagesManager, UserRefreshService) call sessionService.subscribe(this). This introduces a cross-thread race where a subscriber added before fire snapshots subscribers can still observe shouldFireOnSubscribe = true afterwards, causing onSessionStarted() to fire twice — duplicate TrackSessionStartOperation enqueues, duplicate UserRefreshService.refreshUser() network calls, duplicate influence/IAM trigger evaluations. Pre-PR this path was deterministic because both handleOnFocus and the subsequent subscribe calls ran on the same startables thread.
Extended reasoning...
Threading topology (post-PR, FF on).
StartupService.scheduleStart() iterates IStartableService.start() sequentially on the Default dispatcher (StartupService.kt:28-37). SessionModule.kt:39-44 registers SessionService before SessionListener, so the loop reaches SessionService.start() first, which calls _applicationService.addApplicationLifecycleHandler(this). ApplicationService.addApplicationLifecycleHandler synchronously invokes handler.onFocus(true) when current != null (ApplicationService.kt:121-128).
Pre-PR, onFocus(true) ran handleOnFocus inline on the Default thread, so the entire body — shouldFireOnSubscribe = true, then sessionLifeCycleNotifier.fire { onSessionStarted() } — completed before SessionService.start() returned. The next startable's subscribe() always observed shouldFireOnSubscribe = true AFTER the fire (with no subscribers yet), giving exactly one onSessionStarted() per handler.
Post-PR, onFocus captures focusTimeMs then does runOnSerialIOIfBackgroundThreading { handleOnFocus(...) } and returns. The Default thread continues to SessionListener.start() → _sessionService.subscribe(this) (and so on for SubscriptionManager, InfluenceManager, UserRefreshService, InAppMessagesManager).
The double-fire interleaving.
SessionService.subscribe(handler):
A1. sessionLifeCycleNotifier.subscribe(handler) → synchronized(subscribers) { add(handler) }
A2. if (shouldFireOnSubscribe) handler.onSessionStarted()
handleOnFocus (in the !session.isValid branch, which is the cold-start path because endSession() invalidates the prior session):
B1. shouldFireOnSubscribe = firedOnSubscribe (true)
B2. sessionLifeCycleNotifier.fire { ... } → synchronized(subscribers) { subscribers.toList() } then for (s in localList) callback(s)
Walk through interleaving A1 → B1 → B2 → A2:
- Default thread acquires
subscribersmonitor, adds the handler, releases. A1's monitor exit synchronizes-with any subsequent acquire of the same monitor. - SerialIO thread sets
shouldFireOnSubscribe = true. - SerialIO thread acquires
subscribersmonitor (after A1's release, so A1 HB B2). Snapshot includes the handler. Releases monitor. Iterates →handler.onSessionStarted()fires (FIRE gradle? #1). - Default thread reads
shouldFireOnSubscribe. With B1 HB B2 (program order) and A1 HB B2 (monitor sync), the read at A2 happens after A1. On real Android ART with cache-coherent ARM cores and the intervening monitor ops, the read commonly observes the newtruevalue →handler.onSessionStarted()fires again (FIRE Added possibility to send status 'opened' for a message back to onesignal backend from client source code #2).
Observable impact. SessionListener.onSessionStarted() (SessionListener.kt:45-55) enqueues a TrackSessionStartOperation; double-fire produces duplicate session-start operations against the user, observable as inflated session counts in analytics. UserRefreshService.onSessionStarted() = refreshUser() (a network call) double-fires. InfluenceManager and InAppMessagesManager re-run session-start side effects. The window is small (the time between the SerialIO worker dequeuing the task and the Default thread's next instruction after subscribe() returns), but real and gated on the exact FF this PR targets.
Why the refutation's "narrow theoretical window" argument doesn't hold. The refutation argues that on cold start the SerialIO dispatcher is slow to spin up, so by the time handleOnFocus runs, all subscribe() calls have completed (no double fire). But the same PR series adds OneSignalDispatchers.prewarm() (OneSignalDispatchers.kt:186-238), called from initWithContext before bootstrapServices() and scheduleStart(). prewarm submits empty coroutines to all three dispatchers — including SerialIO — so the SerialIO worker thread is alive and idle by the time scheduleStart() reaches SessionService.start(). The "cold dispatcher absorbs the race window" assumption is invalidated by prewarm.
Re: missed-fire (Race #2). Acknowledging the refutation's and the third verifier's correct point: the alternative interleaving where subscribe.add happens after fire's snapshot establishes a monitor synchronizes-with edge from fire's monitor exit to subscribe's monitor acquire, which gives B1 HB A2 transitively. So A2 always sees the new true value, and there is exactly one fire (via subscribe). A missed fire is not possible. The bug here is strictly the double-fire scenario; the synthesis description's mention of "missed fire" should be disregarded.
Fix. Restore atomicity between the subscriber-registration check and the fire by either: (a) inside subscribe, take the same monitor used by fire while checking shouldFireOnSubscribe, plus mark shouldFireOnSubscribe @Volatile and read it AFTER releasing the subscribers monitor only if the snapshot was taken before B2; or simpler — (b) run the body inline on the caller thread when firedOnSubscribe == true (the synchronous addApplicationLifecycleHandler path), and only defer the natural lifecycle-event onFocus to serial IO; or (c) have fire and subscribe use the same lock so that "subscribed before fire snapshotted" deterministically implies "did not see shouldFireOnSubscribe = true".
| override fun onStartJob(jobParameters: JobParameters): Boolean { | ||
| // Android delivers JobService.onStartJob on the main thread. The suspendifyOnIO call | ||
| // below is the SDK's first IO-pool consumer on cold start in this process, and the | ||
| // executor + dispatcher + coroutine-scope lazy chain it triggers was producing multi- | ||
| // second main-thread blocks in production (SDK-4507). prewarm() shifts that cost to a | ||
| // short-lived daemon thread; idempotent, so it's harmless when initWithContext already | ||
| // ran. Must happen BEFORE suspendifyOnIO; calling it inside the dispatched block would | ||
| // be too late because the cold-init cost has already been paid on entry to the helper. | ||
| OneSignalDispatchers.prewarm() | ||
|
|
There was a problem hiding this comment.
🟡 The OneSignalDispatchers.prewarm() call in SyncJobService.onStartJob is unlikely to deliver the protection the new comment claims (lines 38-46): because nothing sits between prewarm() and suspendifyOnIO, and Kotlin's by lazy defaults to LazyThreadSafetyMode.SYNCHRONIZED, the main thread typically wins the race to the lazy chain (and pays the init cost) or — if the prewarm daemon wins — blocks on the synchronized monitor for the full init duration. Android's ANR detector treats a monitor-blocked main thread the same as one running slow code, so the cold-start SDK-4507 block this call site specifically targets is not actually prevented here. Consider either inlining the prewarm body into the start of the suspendifyOnIO block (so it always runs off-main), removing the call (the rest of the PR is the real fix), or at minimum updating the comment to reflect the probabilistic semantics.
Extended reasoning...
What the bug is
SyncJobService.onStartJob is delivered on the main thread. The new code is:
override fun onStartJob(jobParameters: JobParameters): Boolean {
OneSignalDispatchers.prewarm() // 1
suspendifyOnIO { ... } // 2
return true
}The accompanying comment asserts: "prewarm() shifts that cost to a short-lived daemon thread." But in this specific call site there is zero synchronous work between (1) and (2), which is exactly what the prewarm optimization needs to be effective. Contrast with OneSignalImp.initWithContext, where prewarm() is followed by synchronized(initLock) { ... }, ensureApplicationServiceStarted(), etc. — substantial wall-clock work that gives the daemon time to win the race.
Why the optimization is ineffective here
All the relevant properties in OneSignalDispatchers (IO, Default, SerialIO, IOScope, DefaultScope, SerialIOScope, ioExecutor, defaultExecutor, serialIOExecutor) are declared by lazy { ... } with no explicit thread-safety mode, which means Kotlin's default LazyThreadSafetyMode.SYNCHRONIZED applies: whichever thread enters the initializer first owns a synchronized monitor; every other thread blocks on that monitor until init completes.
The race timeline on a cold process (the documented SDK-4507 scenario where JobScheduler resurrects a process without initWithContext having run):
- Main:
prewarm()→Thread.start()returns immediately. The daemon is in NEW/RUNNABLE, awaiting OS scheduling. - Main:
suspendifyOnIO { ... }→suspendifyWithCompletion→if (ThreadingMode.useBackgroundThreading) OneSignalDispatchers.launchOnIO { ... }→ accessesIOScope(lazy) →IO(lazy) →ioExecutor(lazy).
The wall-clock gap between Thread.start() returning and the main thread reaching the lazy chain is on the order of nanoseconds (a few method dispatches). The daemon thread needs to be scheduled by the OS and then execute its own launchOnIO to reach the same lazy chain.
Further compounding the race: the prewarm thread is created with priority = Thread.NORM_PRIORITY - 2 (see OneSignalDispatchers.kt:236), which the OS scheduler treats as a hint to favor the main thread.
The "either outcome is bad" key observation
Regardless of who wins the race, the main thread blocks for ~init_time:
- Main wins → main runs the slow initializer inline.
- Daemon wins → main reaches the lazy access a few microseconds later and blocks on the synchronized monitor until the daemon finishes the 5–20 s init. Android's ANR detector treats a
WAITING/monitor-blocked main thread the same as one running slow code.
Step-by-step proof on a cold process
- Process starts because
JobScheduleris firingSyncJobService.OneSignalImp.initWithContexthas not run;prewarmStarted == false. - Main enters
onStartJob. Callsprewarm(): it setsprewarmStarted = true, constructsThread("OneSignal-prewarm", priority=NORM_PRIORITY-2), calls.start(), returns. Daemon is RUNNABLE but not yet on a CPU. - Main proceeds to
suspendifyOnIO { ... }. Within a few ns: enterssuspendifyWithCompletion, checksThreadingMode.useBackgroundThreading(the SDK-4507 cohort = true), callsOneSignalDispatchers.launchOnIO. That function body isreturn IOScope.launch { block() }— touchingIOScopetriggersby lazy { CoroutineScope(SupervisorJob() + IO) }, which in turn touchesIO, which in turn touchesioExecutor. The first of these synchronized monitors that the daemon hasn't already entered, main enters. - Case A — main wins the monitor: main constructs
ioExecutor(ThreadPoolExecutor, allocating its internalLinkedBlockingQueue, spawning core threads, etc.), wraps it inasCoroutineDispatcher(), constructsIOScope, then calls.launch { block() }which hitsThreadPoolExecutor.execute→LinkedBlockingQueue.offer— exactly the stack the PR description cites. Main is blocked for the full init duration. - Case B — daemon wins the monitor (less common because of priority + the head-start gap): daemon runs the same construction. Main reaches the lazy access ~μs later, finds the monitor held, blocks in
WAITINGuntil daemon releases. Same wall-clock cost on main.
What about the refutation?
The refutation argues this is not actionable because (a) the common case is idempotent (already-initialized SDK → prewarmStarted == true → no-op, lazy chain already warm), (b) the code is not worse than pre-PR, and (c) the bug is really a critique of the comment. Those points are fair:
- The code is correct: no crash, no incorrect behavior.
- It does not regress anything: pre-PR, the main thread blocked 100% of the time on cold start; post-PR, it still blocks (deterministically when main wins, monitor-blocked when daemon wins).
But the reason this is still worth surfacing rather than abstaining: the PR description and the new comment at SyncJobService.kt:38-46 both frame this prewarm() call as a real piece of the SDK-4507 fix, and the JobService cold-start scenario is precisely the one cited in the description ("Android cold-starts a process just to run SyncJobService"). A future maintainer reading the comment will reasonably believe the SyncJobService path is covered, when in practice the cold-process block remains. The real protection in this PR comes from runOnSerialIOIfBackgroundThreading on the lifecycle handlers, not from this prewarm() call.
How to fix
Options (any one is sufficient):
- Inline the warming into the dispatched block: move all of
onStartJob's body inside thesuspendifyOnIO(already does that today) — but skipprewarm()entirely here, since by the timesuspendifyOnIOruns the cold-init cost has been paid on the IO worker thread anyway. The currentprewarm()is dead weight in the cold-process case. - Use a non-ANR-sensitive thread for the JobService entry: spawn a worker thread (or use
runOnSerialIOIfBackgroundThreading) for the entireonStartJobbody, so the cold-init cost — whoever pays it — never lands on main. - At minimum, update the comment to acknowledge the race semantics: prewarm is only effective when meaningful synchronous work follows it on the caller's thread, which is not the case here. This keeps a future maintainer from assuming SyncJobService is covered when it isn't.
The third option is the lightest touch and removes the misleading load-bearing claim without changing behavior.
520ae2c to
0621414
Compare
|
Closing — the SessionService.onFocus / onUnfocused offload from this PR has been consolidated into #2644 (the unified onFocus / onUnfocused offload PR). All five lifecycle handlers (BackgroundManager, NotificationsManager, NotificationPermissionController, FeatureFlagsRefreshService, SessionService) now move together on the same FF in #2644. |
Summary
ANR-dump analysis of
logs/2026-05-12 (500 ANR entries, all on sdk_background_threading)attributes 25 of 500 ANRs (5.0%) toSessionService.onFocusblocking the main thread. Stack signature:SessionService.onFocusruns synchronously on the main thread viaApplicationService.handleFocus -> applicationLifecycleNotifier.fire. Inside,sessionLifeCycleNotifier.fire { onSessionStarted / onSessionActive }invokes the registered session-lifecycle handlers (operation repo, IAM trigger eval, etc.) synchronously; the first one to touchOneSignalDispatcherspays the cold-init cost on the main thread (5–20s blocks before the watchdog fires).Same SDK-4505 / SDK-4506 / SDK-4507 fanout pattern — but on a pre-existing handler that predates background threading.
Linear: SDK-4508
Fix
Wrap
onFocus/onUnfocusedbodies inrunOnSerialIOIfBackgroundThreading(the same gated helper SDK-4505, SDK-4506, and SDK-4507 use). FF off retains the original inline semantics for the rollout control cohort and existing synchronous tests; FF on routes through the dedicated serial IO dispatcher so the fanout no longer runs on main.Timing semantics preserved
The body of
onFocuswritessession.startTime = _time.currentTimeMillisandsession.focusTime = _time.currentTimeMillisto record when Android delivered the lifecycle event. If we simply deferred everything to the serial dispatcher, those timestamps would shift to whenever the dispatcher got around to the block — which can be hundreds of ms later under load.To keep the existing semantics:
onFocuscaptures_time.currentTimeMillison the caller's thread BEFORE the wrapper, passes it into the deferredhandleOnFocus(firedOnSubscribe, focusTimeMs)block.onUnfocuseddoes the same soactiveDurationaccounting stays accurate (dt = unfocusTimeMs - session.focusTime, both captured on the main thread).Base branch
Stacked on
ar/sdk-4506-notifications-onfocus-anr(#2644) because the fix relies onrunOnSerialIOIfBackgroundThreading, which was introduced in that PR. Rebases cleanly ontomainafter #2643 and #2644 land.Test plan
SessionServiceTestsexisting assertions about state mutation continue to pass (the FF defaults to off, so the wrapper runs inline; the new privatehandleOnFocus/handleOnUnfocusedpreserve behavior).onFocusdispatches the session-mutation body throughrunOnSerialIOIfBackgroundThreading.onUnfocuseddispatches theactiveDurationupdate throughrunOnSerialIOIfBackgroundThreading.onUnfocused -> onFocusburst routes both events through the same gated helper in submission order (mirrors SDK-4505BackgroundManagerburst coverage; protects againstactiveDurationdrift if the events ever raced across an unbounded IO pool).:core detekt— no new findings.Made with Cursor