Skip to content

activation_checkpointing: default num_layers to None so configure() assert fires#8041

Open
Kymi808 wants to merge 3 commits into
deepspeedai:masterfrom
Kymi808:fix/activation-checkpointing-num-layers-default
Open

activation_checkpointing: default num_layers to None so configure() assert fires#8041
Kymi808 wants to merge 3 commits into
deepspeedai:masterfrom
Kymi808:fix/activation-checkpointing-num-layers-default

Conversation

@Kymi808
Copy link
Copy Markdown

@Kymi808 Kymi808 commented Jun 1, 2026

Summary

Calling deepspeed.checkpointing.configure(contiguous_checkpointing=True, partition_activations=True) without setting num_checkpoints (or otherwise providing a layer count) is supposed to fail fast with the assert at deepspeed/runtime/activation_checkpointing/checkpointing.py:1108:

if CONTIGUOUS_CHECKPOINTING:
    assert num_layers is not None, "Must specify the number of layers with contiguous memory checkpointing"

That assert never fires because _configure_defaults() (called inside configure() at line 1079) initialized:

num_layers = False

False is not None is True, so the assert silently passes. The user instead hits a much later cryptic IndexError from range(num_layers) (lines 399 / 406) or a 0-element allocation from numel * num_layers (lines 457 / 461).

The module-level default at line 46 is already num_layers = None, and every other path that sets it (set_num_layers, config.number_checkpoints) assigns an integer — only _configure_defaults used False, which looks like a copy-paste from the surrounding PARTITION_ACTIVATIONS = False etc. block.

Fix

-    num_layers = False
+    num_layers = None

One-character change. No callers compare num_layers to False (downstream uses are range(num_layers) and numel * num_layers, both requiring an int), so the only path this changes is the broken one: users now get the documented "Must specify the number of layers" assert instead of a downstream IndexError.

Test

Adds test_configure_with_contiguous_checkpointing_requires_num_checkpoints to the existing tests/unit/runtime/activation_checkpointing/test_activation_checkpointing.py (consolidating per AGENTS.md, not a new file). It calls configure(contiguous_checkpointing=True, partition_activations=True) and asserts the expected AssertionError matches "number of layers". On main the assert silently passes and the test fails; with this change it passes.

CI/lint

  • pre-commit run --files <changed files>: all hooks pass (yapf, flake8, check-torchdist, check-license, codespell, check-torchcuda).
  • DCO Signed-off-by on the commit.
  • No competing PR (gh pr list --search "num_layers checkpointing OR _configure_defaults OR contiguous_checkpointing assert").

…() assert fires

The module-level default for ``num_layers`` is ``None`` (line 46), and
``configure()`` asserts ``num_layers is not None`` with the message "Must
specify the number of layers with contiguous memory checkpointing" when
``CONTIGUOUS_CHECKPOINTING`` is enabled (line 1108).

``_configure_defaults()`` was initializing ``num_layers = False`` instead.
``False is not None`` evaluates to True, so the assert silently passed when
``configure(contiguous_checkpointing=True)`` was called without
``num_checkpoints``. The user then hit a cryptic ``IndexError`` deep inside
``range(num_layers)`` / ``numel * num_layers`` rather than the intended
helpful assertion message.

Set the default to ``None`` so the assert behaves as documented. Adds a
regression test in tests/unit/runtime/activation_checkpointing/.

Signed-off-by: Kymi808 <zeng.kyle13@gmail.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3793254207

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +322 to +325
deepspeed.checkpointing.configure(
mpu_=None,
partition_activations=True,
contiguous_checkpointing=True,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Restore checkpointing state after assert test

When this test runs before any other activation-checkpointing test in the same pytest worker, this configure() call mutates module globals before raising, leaving PARTITION_ACTIVATIONS and CONTIGUOUS_CHECKPOINTING true while num_layers is None. Subsequent ckpt/non_reentrant_checkpoint calls then enter partition_activations(..., contiguous_checkpoint=True) and reach range(num_layers), causing order-dependent failures; please restore the previous checkpointing globals or reset defaults in a finally block after asserting the exception.

Useful? React with 👍 / 👎.

Codex review (deepspeedai#8041) noted that `configure()` mutates module globals
(`PARTITION_ACTIVATIONS`, `CONTIGUOUS_CHECKPOINTING`, etc.) before the
assertion fires, leaving them set when control unwinds. Subsequent
activation-checkpointing tests sharing the same pytest worker then enter
`partition_activations(..., contiguous_checkpoint=True)` and hit
`range(num_layers)` with `num_layers=None`, causing order-dependent
failures — exactly what surfaced as the `modal-torch-latest` CI failure.

Snapshot and restore the relevant module globals around the call so the
test cleans up after itself.

Signed-off-by: Kymi808 <zeng.kyle13@gmail.com>
@Kymi808
Copy link
Copy Markdown
Author

Kymi808 commented Jun 1, 2026

Thanks Codex — addressed in 41f3fd1. The test now snapshots and restores PARTITION_ACTIVATIONS, CONTIGUOUS_CHECKPOINTING, num_layers, CPU_CHECKPOINT, SYNCHRONIZE, PROFILE_TIME, mpu, and deepspeed_checkpointing_enabled around the configure() call via a try/finally, so the order-dependent failure in modal-torch-latest shouldn't recur.

@tohtana
Copy link
Copy Markdown
Collaborator

tohtana commented Jun 2, 2026

Thank you @Kymi808 for your contribution.
This looks good to me. Can you fix the formatting error?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants