Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/3814
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 23b4344 with merge base 05c08f2 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
Claude finished @sekyondaMeta's task in 2m 12s —— View job Review of PR #3814
Blocking Issues1. Wrong file extension — Sphinx will not process this file The file is named source_suffix = ".rst"There is no MyST-Parser or any markdown extension installed (see The file content is written in Markdown (headers with
2. Missing Per the repo's contribution guidelines (
Neither of these was done. There is no reference to 3. Missing thumbnail image New tutorials require a square, high-resolution thumbnail image in Content & Structural Issues4. Relationship with existing flight recorder tutorial There is an existing flight recorder tutorial at
If this is meant to be separate, consider at least cross-linking to the existing tutorial and explaining how the two differ (this one focuses on TorchComms debug server dumps vs. the existing one which focuses on the older env-var-based FR configuration). 5. Local path leaked in Reference section (line 297) This is a local filesystem path from the author's machine and should be removed. 6. Directory placement question The file is placed in 7. Demo script ( The tutorial references
8. Table of Contents section is redundant Lines 16-24 manually list a Table of Contents with anchor links. Sphinx (and RST) auto-generate TOCs via the Minor Issues
SummaryThe tutorial content itself is well-structured and covers a useful debugging workflow. However, this PR has several blocking issues that prevent it from functioning in the build:
These all need to be addressed before this can be merged. |
| @@ -0,0 +1,320 @@ | |||
| # Debugging a Rank Hang with Flight Recorder | |||
|
|
|||
There was a problem hiding this comment.
Thank you for submitting this tutorial! Please make sure to follow the template. The tutorial should be in .rst or .py, not markdown. Please add to a toctrees (distributed.rst perhaps) and in index.tst need to add a customcarditem? We also have a tutorial like this: https://github.com/pytorch/tutorials/blob/main/unstable_source/flight_recorder_tutorial.rst#L4 - can you please take a look and let us know how this tutorial and yours align?
There was a problem hiding this comment.
It's a different way to use flight recorder without explicitly calling the API's using the debug server. The tutorial shows using torchcomms instead of pytorch.
Summary: Add a tutorial for debugging single-rank hangs in distributed PyTorch jobs using the TorchComms Flight Recorder, covering both aggregated text dump analysis and per-rank pickle-based CLI detection workflows.
|
@claude review these recent updates to the PR |
|
Claude finished @sekyondaMeta's task in 7m 41s —— View job Review of PR #3814 (Updated)
Issues Resolved Since Last ReviewThe PR has been significantly updated and addresses several of the blocking issues from the previous review:
Remaining IssueLink checker CI failure — The link checker is failing because of Fix: Add Minor Suggestions (non-blocking)
SummaryThe PR is in much better shape compared to the first version. The RST conversion, index entries, cross-reference, and inlined demo script address all the previously-identified blocking issues. The only remaining CI failure is the link checker flagging |
| This tutorial walks through diagnosing a **single-rank hang** in a | ||
| distributed PyTorch job using the **TorchComms Flight Recorder** and the | ||
| **Debug Server**'s periodic dump. | ||
|
|
||
| For full API documentation on the debug server, its | ||
| endpoints and periodic dumping, see the | ||
| `torch.distributed debug HTTP server docs <https://pytorch.org/docs/main/distributed.html#torch-distributed-debug-http-server>`_. | ||
|
|
||
| For a reference on the Flight Recorder, see | ||
| `Flight Recorder Hook <https://meta-pytorch.org/torchcomms/main/hooks.html#flightrecorderhook>`_ | ||
| in TorchComms. |
There was a problem hiding this comment.
Suggesting to expand the overview a bit so that it the reader has full context:
TorchComms is a Python library that provides a high-level
communicator abstraction over PyTorch's distributed backends (NCCL,
Gloo, etc.). It wraps collective operations with a hook system that
lets you attach instrumentation, such as logging, profiling, or recording,
without modifying application code. For more information, see
TorchComms Documentation <https://meta-pytorch.org/torchcomms/main/index.html>.
The TorchComms Flight Recorder is one such hook
(FlightRecorderHook). It maintains a fixed-size ring buffer that
silently records every collective operation issued through a TorchComms
communicator that captures the operation type, sequence number, tensor
shapes, execution state, and the Python stack trace at the call site. For a reference on the Flight Recorder, see
Flight Recorder Hook <https://meta-pytorch.org/torchcomms/main/hooks.html#flightrecorderhook>_
in TorchComms.
The Debug Server (torch.distributed.debug) runs an HTTP server
on each rank that can periodically dumps Flight Recorder snapshots and
Python stack traces to disk. For full API documentation on the debug server, its
endpoints and periodic dumping, see the
torch.distributed debug HTTP server docs <https://pytorch.org/docs/main/distributed.html#torch-distributed-debug-http-server>_.
This tutorial walks you through a concrete example: a two-rank job where
one rank hangs before a collective, and shows how to use the Flight
Recorder and Debug Server to diagnose it step by step.
| .. contents:: Table of Contents | ||
| :local: | ||
| :depth: 2 | ||
|
|
There was a problem hiding this comment.
Add Prerequisites and What You Will Learn here
| .. contents:: Table of Contents | ||
| :local: | ||
| :depth: 2 |
There was a problem hiding this comment.
we typically don't use this
| :depth: 2 | ||
|
|
||
|
|
||
| Background |
There was a problem hiding this comment.
| Background | |
| Flight Recorder Overview |
| Prerequisites | ||
| ^^^^^^^^^^^^^ | ||
|
|
||
| * PyTorch 2.5 or later | ||
| * ``torchcomms`` installed (``pip install torchcomms``) | ||
| * Use ``TEST_BACKEND=gloo TEST_DEVICE=cpu`` for CPU-only testing, or | ||
| a CUDA host with 2 or more GPUs. |
| python -m torch.distributed.flight_recorder.fr_trace <dir> -p <prefix> -j --print_stack_trace | ||
|
|
||
| # Best-effort when some rank dumps are missing: | ||
| python -m torch.distributed.flight_recorder.fr_trace <dir> -p <prefix> --allow-incomplete-ranks |
There was a problem hiding this comment.
Need to add Conclusion
Conclusion
In this tutorial, you have learned how to use the TorchComms Flight
Recorder and the Debug Server to diagnose a single-rank hang in a
distributed PyTorch job. By examining the aggregated text dumps, per-rank
pickle traces, and stack trace snapshots, you identified which collective
was stuck, which rank failed to participate, and the exact line of code
responsible for the hang. You can apply this same workflow to debug real-world
distributed training hangs — replace the simulated time.sleep with
whatever your job is actually stuck on, and the Flight Recorder will show
you where ranks diverged.
See Also:
- Link
- Link
Summary:
Add a tutorial for debugging single-rank hangs in distributed PyTorch jobs using the TorchComms Flight Recorder, covering both aggregated text dump analysis and per-rank pickle-based CLI detection workflows.
cc @jiannanWang @d4l3k