Skip to content

add flight recorder tutorial#3814

Open
tushar00jain wants to merge 1 commit intopytorch:mainfrom
tushar00jain:pr3814
Open

add flight recorder tutorial#3814
tushar00jain wants to merge 1 commit intopytorch:mainfrom
tushar00jain:pr3814

Conversation

@tushar00jain
Copy link
Copy Markdown

@tushar00jain tushar00jain commented Apr 2, 2026

Summary:
Add a tutorial for debugging single-rank hangs in distributed PyTorch jobs using the TorchComms Flight Recorder, covering both aggregated text dump analysis and per-rank pickle-based CLI detection workflows.

cc @jiannanWang @d4l3k

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Apr 2, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/3814

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 23b4344 with merge base 05c08f2 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the cla signed label Apr 2, 2026
@claude
Copy link
Copy Markdown

claude bot commented Apr 2, 2026

Claude finished @sekyondaMeta's task in 2m 12s —— View job


Review of PR #3814

  • Read the changed file and gather context
  • Check file format and naming conventions
  • Check for required index.rst / toctree updates
  • Check for thumbnail image
  • Review tutorial content quality and correctness
  • Post review findings

Blocking Issues

1. Wrong file extension — Sphinx will not process this file

The file is named debugging_rank_hang_with_flight_recorder.rst.md, but the Sphinx configuration in conf.py:298 explicitly sets:

source_suffix = ".rst"

There is no MyST-Parser or any markdown extension installed (see conf.py:135-144). Sphinx will completely ignore .rst.md files — the tutorial will not appear in the built site.

The file content is written in Markdown (headers with #, links with [text](url), fenced code blocks with triple backticks). It needs to be either:

  • Option A (Recommended): Converted to RST format and saved as .rst — consistent with all other non-executable tutorials in the repo (e.g., unstable_source/flight_recorder_tutorial.rst).
  • Option B: Add myst_parser to the Sphinx extensions and requirements.txt, and update source_suffix — but this would be a much larger change affecting the whole repo.

Fix this →

2. Missing index.rst entries (customcarditem + toctree)

Per the repo's contribution guidelines (CLAUDE.md), every new tutorial must:

  • Add a customcarditem entry in index.rst
  • Add the tutorial to the corresponding toctree in index.rst

Neither of these was done. There is no reference to debugging_rank_hang anywhere in the index files.

3. Missing thumbnail image

New tutorials require a square, high-resolution thumbnail image in _static/img/thumbnails/cropped/. No thumbnail was added.


Content & Structural Issues

4. Relationship with existing flight recorder tutorial

There is an existing flight recorder tutorial at unstable_source/flight_recorder_tutorial.rst (already listed in unstable_index.rst). This new tutorial covers overlapping ground (same FR CLI tool, same concept of detecting mismatched collectives). It would be worth clarifying:

  • Is this intended to replace the existing tutorial?
  • Should this be a companion/extension?
  • Should the existing one be updated instead?

If this is meant to be separate, consider at least cross-linking to the existing tutorial and explaining how the two differ (this one focuses on TorchComms debug server dumps vs. the existing one which focuses on the older env-var-based FR configuration).

5. Local path leaked in Reference section (line 297)

(source: `/home/tushar00jain/local/pytorch/docs/source/distributed.md`)

This is a local filesystem path from the author's machine and should be removed.

6. Directory placement question

The file is placed in intermediate_source/, but the existing flight recorder tutorial is in unstable_source/. If TorchComms and the debug server APIs are still under active development, unstable_source/ may be more appropriate. Worth confirming whether these APIs are considered stable.

7. Demo script (verify_flight_recorder.py) not included

The tutorial references verify_flight_recorder.py as the demo script, but this file is not included in the PR and not provided anywhere in the repo. Readers won't be able to run the demo. Either:

  • Include the script in the PR, or
  • Link to where it lives in the PyTorch repo, or
  • Inline the relevant code in the tutorial (as the existing tutorial does with crash.py)

8. Table of Contents section is redundant

Lines 16-24 manually list a Table of Contents with anchor links. Sphinx (and RST) auto-generate TOCs via the .. contents:: directive. If converted to RST, this manual TOC should be removed in favor of the native directive.


Minor Issues

  • Line 60: Uses the Unicode character in time.sleep(∞). This is fine for display but should be verified it renders correctly in the built HTML.
  • Line 82-83: Mentions torchcomms and torch.distributed.debug as installable packages with deps tabulate, jinja2, aiohttp. It would be helpful to specify how to install these (pip package names, version requirements).
  • Line 265-266: States "Rank 0 has 5 entries (3 all_reduce + 1 broadcast + the stuck all_reduce)" but the side-by-side output only shows 3 entries for Rank 0 (2 all_reduce + 1 broadcast). Either the example output is truncated or the text count is wrong.

Summary

The tutorial content itself is well-structured and covers a useful debugging workflow. However, this PR has several blocking issues that prevent it from functioning in the build:

  1. The .rst.md file extension is not recognized by Sphinx — the file must be converted to .rst
  2. Required index.rst entries (card + toctree) are missing
  3. No thumbnail image was added
  4. The demo script referenced in the tutorial is not included

These all need to be addressed before this can be merged.

@svekars svekars added the torchcomms Tutorials about torchcomms label Apr 2, 2026
@@ -0,0 +1,320 @@
# Debugging a Rank Hang with Flight Recorder

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for submitting this tutorial! Please make sure to follow the template. The tutorial should be in .rst or .py, not markdown. Please add to a toctrees (distributed.rst perhaps) and in index.tst need to add a customcarditem? We also have a tutorial like this: https://github.com/pytorch/tutorials/blob/main/unstable_source/flight_recorder_tutorial.rst#L4 - can you please take a look and let us know how this tutorial and yours align?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a different way to use flight recorder without explicitly calling the API's using the debug server. The tutorial shows using torchcomms instead of pytorch.

Summary:
Add a tutorial for debugging single-rank hangs in distributed PyTorch jobs using the TorchComms Flight Recorder, covering both aggregated text dump analysis and per-rank pickle-based CLI detection workflows.
@tushar00jain tushar00jain marked this pull request as ready for review April 2, 2026 17:43
@sekyondaMeta sekyondaMeta added the skip-link-check Will allow you to skip linkcheck on a PR. Should only should be used when a link can't be fixed. label Apr 2, 2026
@sekyondaMeta
Copy link
Copy Markdown
Contributor

@claude review these recent updates to the PR

@claude
Copy link
Copy Markdown

claude bot commented Apr 2, 2026

Claude finished @sekyondaMeta's task in 7m 41s —— View job


Review of PR #3814 (Updated)

  • Read the updated changed files and gather context
  • Compare against previous review feedback
  • Check file format and Sphinx compatibility
  • Check index.rst and toctree entries
  • Check for thumbnail image
  • Review tutorial content quality and correctness
  • Check CI status and link checker results
  • Post review findings

Issues Resolved Since Last Review

The PR has been significantly updated and addresses several of the blocking issues from the previous review:

  1. File extension fixed — The file is now intermediate_source/debug_hangs_with_flight_recorder.rst (proper RST format), not .rst.md. The content has been fully converted to RST syntax with proper underline-based headings, .. code-block:: directives, .. list-table:: directives, and RST link syntax. This will be processed by Sphinx correctly.

  2. index.rst customcarditem added — A card entry has been added at index.rst:726-731 with appropriate header, description, image, link, and tags.

  3. distributed.rst toctree entry added — The tutorial is listed in the distributed toctree at distributed.rst:214.

  4. Cross-reference to existing tutorial added — Lines 16-21 include a .. note:: block explaining the relationship to the existing unstable/flight_recorder_tutorial and linking to it via :doc:.

  5. Local filesystem path removed — The leaked /home/tushar00jain/local/pytorch/docs/source/distributed.md path from the previous version is gone.

  6. Demo script inlined — The full verify_flight_recorder.py script is now inlined in the tutorial (lines 89-195) rather than just referenced.

  7. Manual TOC replaced — Now uses the standard .. contents:: RST directive (lines 23-25).


Remaining Issue

Link checker CI failure — The link checker is failing because of http://localhost:25999 in the tutorial text (line 148 in the demo script and line 259 in the expected output). The lychee link checker tries to resolve this URL and fails.

Fix: Add http://localhost:25999 to .lycheeignore. There is precedent for this — the file already ignores http://localhost:6006, http://localhost:8265, http://localhost:8000, and http://127.0.0.1:8000 for similar reasons.

Fix this →


Minor Suggestions (non-blocking)

  • Thumbnail image — The card uses generic-pytorch-logo.png as a placeholder, which is fine for now, but a custom thumbnail would improve discoverability on the tutorials landing page.

  • torchcomms installation (line 199) — The note says pip install torchcomms but also mentions torch.distributed.debug as a dependency. It may help to clarify whether torch.distributed.debug requires a minimum PyTorch version or a separate install.

  • Title length — The customcarditem header ("Debugging Hangs with Flight Recorder Using TorchComms and Debug Server") is quite long. Consider shortening to something like "Debug Hangs with TorchComms Flight Recorder" for better card display.


Summary

The PR is in much better shape compared to the first version. The RST conversion, index entries, cross-reference, and inlined demo script address all the previously-identified blocking issues. The only remaining CI failure is the link checker flagging http://localhost:25999, which can be resolved by adding it to .lycheeignore.


View job

Comment on lines +4 to +14
This tutorial walks through diagnosing a **single-rank hang** in a
distributed PyTorch job using the **TorchComms Flight Recorder** and the
**Debug Server**'s periodic dump.

For full API documentation on the debug server, its
endpoints and periodic dumping, see the
`torch.distributed debug HTTP server docs <https://pytorch.org/docs/main/distributed.html#torch-distributed-debug-http-server>`_.

For a reference on the Flight Recorder, see
`Flight Recorder Hook <https://meta-pytorch.org/torchcomms/main/hooks.html#flightrecorderhook>`_
in TorchComms.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggesting to expand the overview a bit so that it the reader has full context:

TorchComms is a Python library that provides a high-level
communicator abstraction over PyTorch's distributed backends (NCCL,
Gloo, etc.). It wraps collective operations with a hook system that
lets you attach instrumentation, such as logging, profiling, or recording,
without modifying application code. For more information, see
TorchComms Documentation <https://meta-pytorch.org/torchcomms/main/index.html>.

The TorchComms Flight Recorder is one such hook
(FlightRecorderHook). It maintains a fixed-size ring buffer that
silently records every collective operation issued through a TorchComms
communicator that captures the operation type, sequence number, tensor
shapes, execution state, and the Python stack trace at the call site. For a reference on the Flight Recorder, see
Flight Recorder Hook <https://meta-pytorch.org/torchcomms/main/hooks.html#flightrecorderhook>_
in TorchComms.

The Debug Server (torch.distributed.debug) runs an HTTP server
on each rank that can periodically dumps Flight Recorder snapshots and
Python stack traces to disk. For full API documentation on the debug server, its
endpoints and periodic dumping, see the
torch.distributed debug HTTP server docs <https://pytorch.org/docs/main/distributed.html#torch-distributed-debug-http-server>_.

This tutorial walks you through a concrete example: a two-rank job where
one rank hangs before a collective, and shows how to use the Flight
Recorder and Debug Server to diagnose it step by step.

.. contents:: Table of Contents
:local:
:depth: 2

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add Prerequisites and What You Will Learn here

Comment on lines +23 to +25
.. contents:: Table of Contents
:local:
:depth: 2
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we typically don't use this

:depth: 2


Background
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Background
Flight Recorder Overview

Comment on lines +207 to +213
Prerequisites
^^^^^^^^^^^^^

* PyTorch 2.5 or later
* ``torchcomms`` installed (``pip install torchcomms``)
* Use ``TEST_BACKEND=gloo TEST_DEVICE=cpu`` for CPU-only testing, or
a CUDA host with 2 or more GPUs.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move up to the top.

python -m torch.distributed.flight_recorder.fr_trace <dir> -p <prefix> -j --print_stack_trace

# Best-effort when some rank dumps are missing:
python -m torch.distributed.flight_recorder.fr_trace <dir> -p <prefix> --allow-incomplete-ranks
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add Conclusion

Conclusion

In this tutorial, you have learned how to use the TorchComms Flight
Recorder and the Debug Server to diagnose a single-rank hang in a
distributed PyTorch job. By examining the aggregated text dumps, per-rank
pickle traces, and stack trace snapshots, you identified which collective
was stuck, which rank failed to participate, and the exact line of code
responsible for the hang. You can apply this same workflow to debug real-world
distributed training hangs — replace the simulated time.sleep with
whatever your job is actually stuck on, and the Flight Recorder will show
you where ranks diverged.

See Also:

  • Link
  • Link

@svekars
Copy link
Copy Markdown
Contributor

svekars commented Apr 2, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed skip-link-check Will allow you to skip linkcheck on a PR. Should only should be used when a link can't be fixed. torchcomms Tutorials about torchcomms

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants