fix: Fix Snapshotter handling of out of order samples#1735
Conversation
… expected source of flaky behavior
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1735 +/- ##
==========================================
- Coverage 92.41% 92.37% -0.05%
==========================================
Files 156 156
Lines 10628 10624 -4
==========================================
- Hits 9822 9814 -8
- Misses 806 810 +4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
vdusek
left a comment
There was a problem hiding this comment.
FAILED tests/unit/_autoscaling/test_snapshotter.py::test_snapshot_pruning_removes_outdated_records - assert 3 == 2
- where 3 = len([CpuSnapshot(used_ratio=0.5, max_used_ratio=0.95, created_at=datetime.datetime(2026, 2, 11, 7, 39, 32, 789908, tzinfo=datetime.timezone.utc)), CpuSnapshot(used_ratio=0.5, max_used_ratio=0.95, created_at=datetime.datetime(2026, 2, 11, 9, 39, 32, 789908, tzinfo=datetime.timezone.utc)), CpuSnapshot(used_ratio=0.5, max_used_ratio=0.95, created_at=datetime.datetime(2026, 2, 11, 10, 39, 32, 789908, tzinfo=datetime.timezone.utc))])
============ 1 failed, 1615 passed, 8 skipped in 345.52s (0:05:45) =============
Error: Process completed with exit code 1.
d7783c5 to
63e8468
Compare
|
@Mantisus , @vdusek, please check the root cause in the issue description. The suggested fix should be ok, but I fear it is too implicit and relies on |
The root cause hypothesis seems incorrect. If the cause was an error processing in the 2-hour snapshot, then there would be 1 snapshot in the failed tests, not 3. It might be worth trying for event_data in events_data:
event_manager.emit(event=Event.SYSTEM_INFO, event_data=event_data)
await event_manager.wait_for_all_listeners_to_complete()Perhaps the reason is that |
Yes, the hypothesis in the issue description seems incorrect. I was referring to the root cause description in this PR |
I'd leave it up to you. If you believe the current approach is stable, I'm okay with it. Otherwise, let's implement the thread locks. |
Got it. Perhaps the race condition also affects this. I hypothesize that when |
Different insertion order is not a problem, as it should be handled by using |
The snapshot is added to Therefore, it is important which snapshot will be processed last. In the next test, 0 snapshots will be deleted. This is because the last snapshot processed is async def test_without_pruning(
snapshotter: Snapshotter, event_manager: LocalEventManager, default_memory_info: MemoryInfo
) -> None:
snapshotter._SNAPSHOT_HISTORY = timedelta(hours=2)
now = datetime.now(timezone.utc)
snapshots = [
CpuSnapshot(used_ratio=0.5, max_used_ratio=0.95, created_at=now - timedelta(hours=delta)) for delta in [5, 2, 0]
]
for snapshot in snapshots: # create `SortedSnapshotList` with 3 elems
snapshotter._cpu_snapshots.add(snapshot)
event_manager.emit(event=Event.SYSTEM_INFO, event_data=EventSystemInfoData(
cpu_info=CpuInfo(used_ratio=0.5, created_at=now - timedelta(hours=3)),
memory_info=default_memory_info,
))
await event_manager.wait_for_all_listeners_to_complete()
cpu_snapshots = cast('list[CpuSnapshot]', snapshotter.get_cpu_sample())
assert len(cpu_snapshots) == 4 # Not one was removed |
|
@Mantisus good point! To ensure that the |
|
Updated to handle the scenario mentioned by @Mantisus when the last snapshot is out of order. |
test_snapshot_pruning_removes_outdated_recordsSnapshotter handling of out of order new samples
Snapshotter handling of out of order new samplesSnapshotter handling of out of order samples
Description:
test_snapshot_pruning_removes_outdated_recordsto remove the expected source of flaky behaviorevent_manager.on(event=Event.SYSTEM_INFO, listener=self._snapshot_cpu)addsself._snapshot_cpulistener, but event manager runs sync listeners throughasyncio.to_thread.self._snapshot_cpumodifies in-place instance list (for exampleself._cpu_snapshots) - it doesbisect.insortanddeloperations on the same list from several threads, which creates oportunity for a race condition.Snapshotis now added first, and the pruning of the snapshot list is done based onSnapshotwith the newest date in the list (the newest date does not have to be the last added Snapshot). This fixes the bug when the last out-of-orderSnapshotcould cause wrong pruning.Issues:
Closes: #1734