add flight recorder tutorial by tushar00jain · Pull Request #3814 · pytorch/tutorials

tushar00jain · 2026-04-02T02:35:10Z

Summary:
Add a tutorial for debugging single-rank hangs in distributed PyTorch jobs using the TorchComms Flight Recorder, covering both aggregated text dump analysis and per-rank pickle-based CLI detection workflows.

cc @jiannanWang @d4l3k

pytorch-bot · 2026-04-02T02:35:14Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/3814

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 23b4344 with merge base 05c08f2 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

claude · 2026-04-02T14:14:35Z

Claude finished @sekyondaMeta's task in 2m 12s —— View job

Review of PR #3814

Read the changed file and gather context
Check file format and naming conventions
Check for required index.rst / toctree updates
Check for thumbnail image
Review tutorial content quality and correctness
Post review findings

Blocking Issues

1. Wrong file extension — Sphinx will not process this file

The file is named debugging_rank_hang_with_flight_recorder.rst.md, but the Sphinx configuration in conf.py:298 explicitly sets:

source_suffix = ".rst"

There is no MyST-Parser or any markdown extension installed (see conf.py:135-144). Sphinx will completely ignore .rst.md files — the tutorial will not appear in the built site.

The file content is written in Markdown (headers with #, links with [text](url), fenced code blocks with triple backticks). It needs to be either:

Option A (Recommended): Converted to RST format and saved as .rst — consistent with all other non-executable tutorials in the repo (e.g., unstable_source/flight_recorder_tutorial.rst).
Option B: Add myst_parser to the Sphinx extensions and requirements.txt, and update source_suffix — but this would be a much larger change affecting the whole repo.

Fix this →

2. Missing index.rst entries (customcarditem + toctree)

Per the repo's contribution guidelines (CLAUDE.md), every new tutorial must:

Add a customcarditem entry in index.rst
Add the tutorial to the corresponding toctree in index.rst

Neither of these was done. There is no reference to debugging_rank_hang anywhere in the index files.

3. Missing thumbnail image

New tutorials require a square, high-resolution thumbnail image in _static/img/thumbnails/cropped/. No thumbnail was added.

Content & Structural Issues

4. Relationship with existing flight recorder tutorial

There is an existing flight recorder tutorial at unstable_source/flight_recorder_tutorial.rst (already listed in unstable_index.rst). This new tutorial covers overlapping ground (same FR CLI tool, same concept of detecting mismatched collectives). It would be worth clarifying:

Is this intended to replace the existing tutorial?
Should this be a companion/extension?
Should the existing one be updated instead?

If this is meant to be separate, consider at least cross-linking to the existing tutorial and explaining how the two differ (this one focuses on TorchComms debug server dumps vs. the existing one which focuses on the older env-var-based FR configuration).

5. Local path leaked in Reference section (line 297)

(source: `/home/tushar00jain/local/pytorch/docs/source/distributed.md`)

This is a local filesystem path from the author's machine and should be removed.

6. Directory placement question

The file is placed in intermediate_source/, but the existing flight recorder tutorial is in unstable_source/. If TorchComms and the debug server APIs are still under active development, unstable_source/ may be more appropriate. Worth confirming whether these APIs are considered stable.

7. Demo script (verify_flight_recorder.py) not included

The tutorial references verify_flight_recorder.py as the demo script, but this file is not included in the PR and not provided anywhere in the repo. Readers won't be able to run the demo. Either:

Include the script in the PR, or
Link to where it lives in the PyTorch repo, or
Inline the relevant code in the tutorial (as the existing tutorial does with crash.py)

8. Table of Contents section is redundant

Lines 16-24 manually list a Table of Contents with anchor links. Sphinx (and RST) auto-generate TOCs via the .. contents:: directive. If converted to RST, this manual TOC should be removed in favor of the native directive.

Minor Issues

Line 60: Uses the Unicode character ∞ in time.sleep(∞). This is fine for display but should be verified it renders correctly in the built HTML.
Line 82-83: Mentions torchcomms and torch.distributed.debug as installable packages with deps tabulate, jinja2, aiohttp. It would be helpful to specify how to install these (pip package names, version requirements).
Line 265-266: States "Rank 0 has 5 entries (3 all_reduce + 1 broadcast + the stuck all_reduce)" but the side-by-side output only shows 3 entries for Rank 0 (2 all_reduce + 1 broadcast). Either the example output is truncated or the text count is wrong.

Summary

The tutorial content itself is well-structured and covers a useful debugging workflow. However, this PR has several blocking issues that prevent it from functioning in the build:

The .rst.md file extension is not recognized by Sphinx — the file must be converted to .rst
Required index.rst entries (card + toctree) are missing
No thumbnail image was added
The demo script referenced in the tutorial is not included

These all need to be addressed before this can be merged.

svekars · 2026-04-02T14:22:55Z

intermediate_source/debugging_rank_hang_with_flight_recorder.rst.md

@@ -0,0 +1,320 @@
+# Debugging a Rank Hang with Flight Recorder
+


Thank you for submitting this tutorial! Please make sure to follow the template. The tutorial should be in .rst or .py, not markdown. Please add to a toctrees (distributed.rst perhaps) and in index.tst need to add a customcarditem? We also have a tutorial like this: https://github.com/pytorch/tutorials/blob/main/unstable_source/flight_recorder_tutorial.rst#L4 - can you please take a look and let us know how this tutorial and yours align?

It's a different way to use flight recorder without explicitly calling the API's using the debug server. The tutorial shows using torchcomms instead of pytorch.

Summary: Add a tutorial for debugging single-rank hangs in distributed PyTorch jobs using the TorchComms Flight Recorder, covering both aggregated text dump analysis and per-rank pickle-based CLI detection workflows.

sekyondaMeta · 2026-04-02T19:50:18Z

@claude review these recent updates to the PR

claude · 2026-04-02T19:50:46Z

Claude finished @sekyondaMeta's task in 7m 41s —— View job

Review of PR #3814 (Updated)

Read the updated changed files and gather context
Compare against previous review feedback
Check file format and Sphinx compatibility
Check index.rst and toctree entries
Check for thumbnail image
Review tutorial content quality and correctness
Check CI status and link checker results
Post review findings

Issues Resolved Since Last Review

The PR has been significantly updated and addresses several of the blocking issues from the previous review:

File extension fixed — The file is now intermediate_source/debug_hangs_with_flight_recorder.rst (proper RST format), not .rst.md. The content has been fully converted to RST syntax with proper underline-based headings, .. code-block:: directives, .. list-table:: directives, and RST link syntax. This will be processed by Sphinx correctly.
index.rst customcarditem added — A card entry has been added at index.rst:726-731 with appropriate header, description, image, link, and tags.
distributed.rst toctree entry added — The tutorial is listed in the distributed toctree at distributed.rst:214.
Cross-reference to existing tutorial added — Lines 16-21 include a .. note:: block explaining the relationship to the existing unstable/flight_recorder_tutorial and linking to it via :doc:.
Local filesystem path removed — The leaked /home/tushar00jain/local/pytorch/docs/source/distributed.md path from the previous version is gone.
Demo script inlined — The full verify_flight_recorder.py script is now inlined in the tutorial (lines 89-195) rather than just referenced.
Manual TOC replaced — Now uses the standard .. contents:: RST directive (lines 23-25).

Remaining Issue

Link checker CI failure — The link checker is failing because of http://localhost:25999 in the tutorial text (line 148 in the demo script and line 259 in the expected output). The lychee link checker tries to resolve this URL and fails.

Fix: Add http://localhost:25999 to .lycheeignore. There is precedent for this — the file already ignores http://localhost:6006, http://localhost:8265, http://localhost:8000, and http://127.0.0.1:8000 for similar reasons.

Fix this →

Minor Suggestions (non-blocking)

Thumbnail image — The card uses generic-pytorch-logo.png as a placeholder, which is fine for now, but a custom thumbnail would improve discoverability on the tutorials landing page.
torchcomms installation (line 199) — The note says pip install torchcomms but also mentions torch.distributed.debug as a dependency. It may help to clarify whether torch.distributed.debug requires a minimum PyTorch version or a separate install.
Title length — The customcarditem header ("Debugging Hangs with Flight Recorder Using TorchComms and Debug Server") is quite long. Consider shortening to something like "Debug Hangs with TorchComms Flight Recorder" for better card display.

Summary

The PR is in much better shape compared to the first version. The RST conversion, index entries, cross-reference, and inlined demo script address all the previously-identified blocking issues. The only remaining CI failure is the link checker flagging http://localhost:25999, which can be resolved by adding it to .lycheeignore.

View job

svekars · 2026-04-02T20:50:07Z