Skip to content

Adding Support 4 Nodes AllReduce Small Message Size#794

Open
caiomcbr wants to merge 5 commits intomainfrom
caiorocha/4_nodes_allreduce
Open

Adding Support 4 Nodes AllReduce Small Message Size#794
caiomcbr wants to merge 5 commits intomainfrom
caiorocha/4_nodes_allreduce

Conversation

@caiomcbr
Copy link
Copy Markdown
Contributor

Results on 4 Nodes H200:

Size NCCL MSCCL++ 57TB MSCCL++ 29TB
8K 45.75 17.74 18.18
16K 47.08 18.9 18.42
32K 47.29 19.48 19.12
64K 50.34 20.51 19.29
128K 59.65 21.37 20.25
256K 87.46 23.87 23.51
512K 106.55 29.15 29.51
1M 115 40.64 41.83
2M 135.89 63.73 70.45
4M 177.59 121.76 128.79
8M 251.17 228.5 251.36

@Binyang2014
Copy link
Copy Markdown
Contributor

@copilot Pls help review the PR

Copy link
Copy Markdown
Contributor

Copilot AI commented May 3, 2026

@copilot Pls help review the PR

Here is the review summary for this PR (commit b30752a fixes the issues found):

Algorithm correctness — The hierarchical approach is sound:

  1. Intra-node reduce-scatter: each GPU sends every peer's data chunk to that peer's scratch buffer, then reduces its own assigned chunk from the received packets.
  2. Inter-node allreduce: each GPU sends its node-level reduced chunk to every corresponding GPU on other nodes (one put_packets per peer node, using tb=0), then reduces across all received contributions.
  3. Intra-node broadcast: each GPU broadcasts its fully reduced chunk to all intra-node peers via read_put_packets, then unpacks the received data.

The scratch buffer layout is correct: 2 * total_gpus + packets_per_gpu * num_nodes bytes per GPU covers all three regions (intra-node scatter, inter-node exchange, broadcast).

Issues fixed (b30752a):

  • Stale module docstring in allreduce_multi_nodes.py was identical to allreduce_2nodes.py; updated to describe the generalized N-node steps accurately.
  • Missing trailing comma on the last entry of default_algo_configs in __main__.py.
  • import argparse moved inside the if __name__ == "__main__" block (it is not needed at module-import time).
  • Pre-existing filename mismatch in algorithm_collection_builder.cc: it was looking for allreduce_2nodes_64K_2M.json while __main__.py generates allreduce_2nodes_128K_2M.json, so that plan was never loaded. Fixed to allreduce_2nodes_128K_2M.json.

Minor observations (no change needed, same pattern as allreduce_2nodes.py):

  • The loop variable i is reused inside the list comprehension — valid Python 3 (comprehension scope is isolated) and matches the existing style.
  • The algorithm is general for any number of nodes but is currently only registered for 4 nodes (32 GPUs × 8 per node, 1K–8M). New size ranges can be added by registering additional AlgoSpec entries.

Copilot AI requested a review from Binyang2014 May 3, 2026 22:52
@Binyang2014 Binyang2014 requested a review from a team May 4, 2026 20:35
)
)

scratch_buffer_size = 2 * total_gpus + packets_per_gpu * num_nodes
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls add some comments for this, how to decide the scratch buffer size

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants