activation_checkpointing: default num_layers to None so configure() assert fires#8041
activation_checkpointing: default num_layers to None so configure() assert fires#8041Kymi808 wants to merge 3 commits into
Conversation
…() assert fires The module-level default for ``num_layers`` is ``None`` (line 46), and ``configure()`` asserts ``num_layers is not None`` with the message "Must specify the number of layers with contiguous memory checkpointing" when ``CONTIGUOUS_CHECKPOINTING`` is enabled (line 1108). ``_configure_defaults()`` was initializing ``num_layers = False`` instead. ``False is not None`` evaluates to True, so the assert silently passed when ``configure(contiguous_checkpointing=True)`` was called without ``num_checkpoints``. The user then hit a cryptic ``IndexError`` deep inside ``range(num_layers)`` / ``numel * num_layers`` rather than the intended helpful assertion message. Set the default to ``None`` so the assert behaves as documented. Adds a regression test in tests/unit/runtime/activation_checkpointing/. Signed-off-by: Kymi808 <zeng.kyle13@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3793254207
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| deepspeed.checkpointing.configure( | ||
| mpu_=None, | ||
| partition_activations=True, | ||
| contiguous_checkpointing=True, |
There was a problem hiding this comment.
Restore checkpointing state after assert test
When this test runs before any other activation-checkpointing test in the same pytest worker, this configure() call mutates module globals before raising, leaving PARTITION_ACTIVATIONS and CONTIGUOUS_CHECKPOINTING true while num_layers is None. Subsequent ckpt/non_reentrant_checkpoint calls then enter partition_activations(..., contiguous_checkpoint=True) and reach range(num_layers), causing order-dependent failures; please restore the previous checkpointing globals or reset defaults in a finally block after asserting the exception.
Useful? React with 👍 / 👎.
Codex review (deepspeedai#8041) noted that `configure()` mutates module globals (`PARTITION_ACTIVATIONS`, `CONTIGUOUS_CHECKPOINTING`, etc.) before the assertion fires, leaving them set when control unwinds. Subsequent activation-checkpointing tests sharing the same pytest worker then enter `partition_activations(..., contiguous_checkpoint=True)` and hit `range(num_layers)` with `num_layers=None`, causing order-dependent failures — exactly what surfaced as the `modal-torch-latest` CI failure. Snapshot and restore the relevant module globals around the call so the test cleans up after itself. Signed-off-by: Kymi808 <zeng.kyle13@gmail.com>
|
Thanks Codex — addressed in |
|
Thank you @Kymi808 for your contribution. |
Summary
Calling
deepspeed.checkpointing.configure(contiguous_checkpointing=True, partition_activations=True)without settingnum_checkpoints(or otherwise providing a layer count) is supposed to fail fast with the assert atdeepspeed/runtime/activation_checkpointing/checkpointing.py:1108:That assert never fires because
_configure_defaults()(called insideconfigure()at line 1079) initialized:False is not NoneisTrue, so the assert silently passes. The user instead hits a much later crypticIndexErrorfromrange(num_layers)(lines 399 / 406) or a 0-element allocation fromnumel * num_layers(lines 457 / 461).The module-level default at line 46 is already
num_layers = None, and every other path that sets it (set_num_layers,config.number_checkpoints) assigns an integer — only_configure_defaultsusedFalse, which looks like a copy-paste from the surroundingPARTITION_ACTIVATIONS = Falseetc. block.Fix
One-character change. No callers compare
num_layerstoFalse(downstream uses arerange(num_layers)andnumel * num_layers, both requiring an int), so the only path this changes is the broken one: users now get the documented "Must specify the number of layers" assert instead of a downstreamIndexError.Test
Adds
test_configure_with_contiguous_checkpointing_requires_num_checkpointsto the existingtests/unit/runtime/activation_checkpointing/test_activation_checkpointing.py(consolidating perAGENTS.md, not a new file). It callsconfigure(contiguous_checkpointing=True, partition_activations=True)and asserts the expectedAssertionErrormatches"number of layers". Onmainthe assert silently passes and the test fails; with this change it passes.CI/lint
pre-commit run --files <changed files>: all hooks pass (yapf, flake8,check-torchdist,check-license, codespell,check-torchcuda).Signed-off-byon the commit.gh pr list --search "num_layers checkpointing OR _configure_defaults OR contiguous_checkpointing assert").