Reject stateful programs whose output aliases a state-write source by john-rocky · Pull Request #2713 · apple/coremltools

john-rocky · 2026-05-22T04:16:39Z

Summary

The Core ML runtime proxy crashes with a segmentation fault (no Python traceback) when loading an mlprogram whose function output Var is the same Var that feeds a coreml_update_state op. It is easy to hit when porting a torch decoder: write to a KV cache with self.cache[:] = merged and then return merged.

This rejects the pattern at conversion time instead, with a ValueError that names the offending output and the affected state and points at a workaround:

Function 'main' has a model output 'merged' that is also a source of the value written into state 'cache'. Loading this model in the Core ML runtime currently crashes with a segmentation fault, so the converter rejects it here instead. Workaround: return a tensor that does not feed the state-write chain, e.g. return value.sum(dim=-1, keepdim=True) or return value * other_tensor.

_validate_no_state_write_aliased_with_output runs in backend/mil/load.py and walks back (bounded depth) from each coreml_update_state value to check whether any ancestor Var is also a model output. It only fires on the exact aliasing pattern, so models whose returned tensor does not feed the state-write chain are unaffected.

Test plan

New coremltools/test/ml_program/test_stateful_output_alias_guard.py:

test_aliasing_pattern_raises_clear_error — the aliasing forward raises the ValueError (asserts the message names the output, the state, and the workaround).
test_non_aliasing_pattern_converts — the reduced-output variant (return merged.sum(...)) still converts (guards against false positives).

Verified locally on macOS (the raise path runs entirely in the converter, before runtime load):

$ python -m pytest coremltools/test/ml_program/test_stateful_output_alias_guard.py::TestStatefulOutputAliasGuard::test_aliasing_pattern_raises_clear_error -q
1 passed, 3 warnings in 1.63s

The non-aliasing case exercises full mlprogram serialization and runtime load, so it runs in CI.

When a function output Var feeds the value side of a coreml_update_state op, the Core ML runtime proxy crashes with a hard segmentation fault on load — no Python traceback, just a process exit. The pattern is natural to write when porting torch decoder transformers (return the merged tensor that was just stored in the KV cache), so the silent crash is a footgun. Add a backend-level validation that walks the source-Var graph backwards from each coreml_update_state value and raises a clear ValueError if it finds a model output along the way. The error names the offending output and the affected state, and points at the workaround. Existing programs that already follow the recommended pattern (return a non-buffer-shaped derived tensor) keep converting unchanged.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reject stateful programs whose output aliases a state-write source#2713

Reject stateful programs whose output aliases a state-write source#2713
john-rocky wants to merge 1 commit into
apple:mainfrom
john-rocky:guard-stateful-output-alias-crash

john-rocky commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

john-rocky commented May 22, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant