Skip to content

Reject stateful programs whose output aliases a state-write source#2713

Open
john-rocky wants to merge 1 commit into
apple:mainfrom
john-rocky:guard-stateful-output-alias-crash
Open

Reject stateful programs whose output aliases a state-write source#2713
john-rocky wants to merge 1 commit into
apple:mainfrom
john-rocky:guard-stateful-output-alias-crash

Conversation

@john-rocky
Copy link
Copy Markdown
Contributor

Summary

The Core ML runtime proxy crashes with a segmentation fault (no Python traceback) when loading an mlprogram whose function output Var is the same Var that feeds a coreml_update_state op. It is easy to hit when porting a torch decoder: write to a KV cache with self.cache[:] = merged and then return merged.

This rejects the pattern at conversion time instead, with a ValueError that names the offending output and the affected state and points at a workaround:

Function 'main' has a model output 'merged' that is also a source of the value written into state 'cache'. Loading this model in the Core ML runtime currently crashes with a segmentation fault, so the converter rejects it here instead. Workaround: return a tensor that does not feed the state-write chain, e.g. return value.sum(dim=-1, keepdim=True) or return value * other_tensor.

_validate_no_state_write_aliased_with_output runs in backend/mil/load.py and walks back (bounded depth) from each coreml_update_state value to check whether any ancestor Var is also a model output. It only fires on the exact aliasing pattern, so models whose returned tensor does not feed the state-write chain are unaffected.

Test plan

New coremltools/test/ml_program/test_stateful_output_alias_guard.py:

  • test_aliasing_pattern_raises_clear_error — the aliasing forward raises the ValueError (asserts the message names the output, the state, and the workaround).
  • test_non_aliasing_pattern_converts — the reduced-output variant (return merged.sum(...)) still converts (guards against false positives).

Verified locally on macOS (the raise path runs entirely in the converter, before runtime load):

$ python -m pytest coremltools/test/ml_program/test_stateful_output_alias_guard.py::TestStatefulOutputAliasGuard::test_aliasing_pattern_raises_clear_error -q
1 passed, 3 warnings in 1.63s

The non-aliasing case exercises full mlprogram serialization and runtime load, so it runs in CI.

When a function output Var feeds the value side of a coreml_update_state
op, the Core ML runtime proxy crashes with a hard segmentation fault on
load — no Python traceback, just a process exit. The pattern is natural
to write when porting torch decoder transformers (return the merged
tensor that was just stored in the KV cache), so the silent crash is a
footgun.

Add a backend-level validation that walks the source-Var graph backwards
from each coreml_update_state value and raises a clear ValueError if it
finds a model output along the way. The error names the offending output
and the affected state, and points at the workaround.

Existing programs that already follow the recommended pattern (return a
non-buffer-shaped derived tensor) keep converting unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant