activation_checkpointing: default num_layers to None so configure() assert fires by Kymi808 · Pull Request #8041 · deepspeedai/DeepSpeed

Kymi808 · 2026-06-01T23:18:14Z

Summary

Calling deepspeed.checkpointing.configure(contiguous_checkpointing=True, partition_activations=True) without setting num_checkpoints (or otherwise providing a layer count) is supposed to fail fast with the assert at deepspeed/runtime/activation_checkpointing/checkpointing.py:1108:

if CONTIGUOUS_CHECKPOINTING:
    assert num_layers is not None, "Must specify the number of layers with contiguous memory checkpointing"

That assert never fires because _configure_defaults() (called inside configure() at line 1079) initialized:

num_layers = False

False is not None is True, so the assert silently passes. The user instead hits a much later cryptic IndexError from range(num_layers) (lines 399 / 406) or a 0-element allocation from numel * num_layers (lines 457 / 461).

The module-level default at line 46 is already num_layers = None, and every other path that sets it (set_num_layers, config.number_checkpoints) assigns an integer — only _configure_defaults used False, which looks like a copy-paste from the surrounding PARTITION_ACTIVATIONS = False etc. block.

Fix

-    num_layers = False
+    num_layers = None

One-character change. No callers compare num_layers to False (downstream uses are range(num_layers) and numel * num_layers, both requiring an int), so the only path this changes is the broken one: users now get the documented "Must specify the number of layers" assert instead of a downstream IndexError.

Test

Adds test_configure_with_contiguous_checkpointing_requires_num_checkpoints to the existing tests/unit/runtime/activation_checkpointing/test_activation_checkpointing.py (consolidating per AGENTS.md, not a new file). It calls configure(contiguous_checkpointing=True, partition_activations=True) and asserts the expected AssertionError matches "number of layers". On main the assert silently passes and the test fails; with this change it passes.

CI/lint

pre-commit run --files <changed files>: all hooks pass (yapf, flake8, check-torchdist, check-license, codespell, check-torchcuda).
DCO Signed-off-by on the commit.
No competing PR (gh pr list --search "num_layers checkpointing OR _configure_defaults OR contiguous_checkpointing assert").

…() assert fires The module-level default for ``num_layers`` is ``None`` (line 46), and ``configure()`` asserts ``num_layers is not None`` with the message "Must specify the number of layers with contiguous memory checkpointing" when ``CONTIGUOUS_CHECKPOINTING`` is enabled (line 1108). ``_configure_defaults()`` was initializing ``num_layers = False`` instead. ``False is not None`` evaluates to True, so the assert silently passed when ``configure(contiguous_checkpointing=True)`` was called without ``num_checkpoints``. The user then hit a cryptic ``IndexError`` deep inside ``range(num_layers)`` / ``numel * num_layers`` rather than the intended helpful assertion message. Set the default to ``None`` so the assert behaves as documented. Adds a regression test in tests/unit/runtime/activation_checkpointing/. Signed-off-by: Kymi808 <zeng.kyle13@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3793254207

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-01T23:20:26Z

+        deepspeed.checkpointing.configure(
+            mpu_=None,
+            partition_activations=True,
+            contiguous_checkpointing=True,


Restore checkpointing state after assert test

When this test runs before any other activation-checkpointing test in the same pytest worker, this configure() call mutates module globals before raising, leaving PARTITION_ACTIVATIONS and CONTIGUOUS_CHECKPOINTING true while num_layers is None. Subsequent ckpt/non_reentrant_checkpoint calls then enter partition_activations(..., contiguous_checkpoint=True) and reach range(num_layers), causing order-dependent failures; please restore the previous checkpointing globals or reset defaults in a finally block after asserting the exception.

Useful? React with 👍 / 👎.

Codex review (deepspeedai#8041) noted that `configure()` mutates module globals (`PARTITION_ACTIVATIONS`, `CONTIGUOUS_CHECKPOINTING`, etc.) before the assertion fires, leaving them set when control unwinds. Subsequent activation-checkpointing tests sharing the same pytest worker then enter `partition_activations(..., contiguous_checkpoint=True)` and hit `range(num_layers)` with `num_layers=None`, causing order-dependent failures — exactly what surfaced as the `modal-torch-latest` CI failure. Snapshot and restore the relevant module globals around the call so the test cleans up after itself. Signed-off-by: Kymi808 <zeng.kyle13@gmail.com>

Kymi808 · 2026-06-01T23:44:01Z

Thanks Codex — addressed in 41f3fd1. The test now snapshots and restores PARTITION_ACTIVATIONS, CONTIGUOUS_CHECKPOINTING, num_layers, CPU_CHECKPOINT, SYNCHRONIZE, PROFILE_TIME, mpu, and deepspeed_checkpointing_enabled around the configure() call via a try/finally, so the order-dependent failure in modal-torch-latest shouldn't recur.

…fault

tohtana · 2026-06-02T21:33:53Z

Thank you @Kymi808 for your contribution.
This looks good to me. Can you fix the formatting error?

Kymi808 requested review from loadams, tjruwase and tohtana as code owners June 1, 2026 23:18

chatgpt-codex-connector Bot reviewed Jun 1, 2026

View reviewed changes

Merge branch 'master' into fix/activation-checkpointing-num-layers-de…

ebb4dce

…fault

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

activation_checkpointing: default num_layers to None so configure() assert fires#8041

activation_checkpointing: default num_layers to None so configure() assert fires#8041
Kymi808 wants to merge 3 commits into
deepspeedai:masterfrom
Kymi808:fix/activation-checkpointing-num-layers-default

Kymi808 commented Jun 1, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 1, 2026

Uh oh!

Kymi808 commented Jun 1, 2026

Uh oh!

tohtana commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Kymi808 commented Jun 1, 2026

Summary

Fix

Test

CI/lint

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Kymi808 commented Jun 1, 2026

Uh oh!

tohtana commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants