Adding Support 4 Nodes AllReduce Small Message Size by caiomcbr · Pull Request #794 · microsoft/mscclpp

caiomcbr · 2026-04-28T03:09:50Z

Results on 4 Nodes H200:

Size	NCCL	MSCCL++ 57TB	MSCCL++ 29TB
8K	45.75	17.74	18.18
16K	47.08	18.9	18.42
32K	47.29	19.48	19.12
64K	50.34	20.51	19.29
128K	59.65	21.37	20.25
256K	87.46	23.87	23.51
512K	106.55	29.15	29.51
1M	115	40.64	41.83
2M	135.89	63.73	70.45
4M	177.59	121.76	128.79
8M	251.17	228.5	251.36

Binyang2014 · 2026-05-03T22:42:36Z

@copilot Pls help review the PR

… mismatch Agent-Logs-Url: https://github.com/microsoft/mscclpp/sessions/f587a2e1-568f-4596-bb02-342c101dd539 Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com>

Copilot · 2026-05-03T22:52:02Z

@copilot Pls help review the PR

Here is the review summary for this PR (commit b30752a fixes the issues found):

Algorithm correctness — The hierarchical approach is sound:

Intra-node reduce-scatter: each GPU sends every peer's data chunk to that peer's scratch buffer, then reduces its own assigned chunk from the received packets.
Inter-node allreduce: each GPU sends its node-level reduced chunk to every corresponding GPU on other nodes (one put_packets per peer node, using tb=0), then reduces across all received contributions.
Intra-node broadcast: each GPU broadcasts its fully reduced chunk to all intra-node peers via read_put_packets, then unpacks the received data.

The scratch buffer layout is correct: 2 * total_gpus + packets_per_gpu * num_nodes bytes per GPU covers all three regions (intra-node scatter, inter-node exchange, broadcast).

Issues fixed (b30752a):

Stale module docstring in allreduce_multi_nodes.py was identical to allreduce_2nodes.py; updated to describe the generalized N-node steps accurately.
Missing trailing comma on the last entry of default_algo_configs in __main__.py.
import argparse moved inside the if __name__ == "__main__" block (it is not needed at module-import time).
Pre-existing filename mismatch in algorithm_collection_builder.cc: it was looking for allreduce_2nodes_64K_2M.json while __main__.py generates allreduce_2nodes_128K_2M.json, so that plan was never loaded. Fixed to allreduce_2nodes_128K_2M.json.

Minor observations (no change needed, same pattern as allreduce_2nodes.py):

The loop variable i is reused inside the list comprehension — valid Python 3 (comprehension scope is isolated) and matches the existing style.
The algorithm is general for any number of nodes but is currently only registered for 4 nodes (32 GPUs × 8 per node, 1K–8M). New size ranges can be added by registering additional AlgoSpec entries.

Binyang2014 · 2026-05-04T22:17:13Z

+                )
+            )
+
+        scratch_buffer_size = 2 * total_gpus + packets_per_gpu * num_nodes


Pls add some comments for this, how to decide the scratch buffer size

wip

d275fcd

Copilot started work on behalf of Binyang2014 May 3, 2026 22:42 View session

Binyang2014 and others added 2 commits May 3, 2026 15:45

Merge branch 'main' into caiorocha/4_nodes_allreduce

2a34a7e

review: fix docstring, trailing comma, import placement, and filename…

b30752a

… mismatch Agent-Logs-Url: https://github.com/microsoft/mscclpp/sessions/f587a2e1-568f-4596-bb02-342c101dd539 Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com>

Copilot finished work on behalf of Binyang2014 May 3, 2026 22:52

Copilot AI requested a review from Binyang2014 May 3, 2026 22:52

Binyang2014 requested a review from a team May 4, 2026 20:35

Merge branch 'main' into caiorocha/4_nodes_allreduce

5286038

Binyang2014 approved these changes May 4, 2026

View reviewed changes

wip

1b9f335

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Support 4 Nodes AllReduce Small Message Size#794

Adding Support 4 Nodes AllReduce Small Message Size#794
caiomcbr wants to merge 5 commits intomainfrom
caiorocha/4_nodes_allreduce

caiomcbr commented Apr 28, 2026

Uh oh!

Binyang2014 commented May 3, 2026

Uh oh!

Copilot AI commented May 3, 2026

Uh oh!

Binyang2014 May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

caiomcbr commented Apr 28, 2026

Uh oh!

Binyang2014 commented May 3, 2026

Uh oh!

Copilot AI commented May 3, 2026

Uh oh!

Binyang2014 May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants