Multinode NVL/NVLS supports by Binyang2014 · Pull Request #798 · microsoft/mscclpp

Binyang2014 · 2026-05-06T17:11:50Z

No description provided.

Bump MAX_NRANKS_PER_NODE from 8 to 72 to cover Multi-Node NVLink (MNNVL) domains up to GB200 NVL72, and bump NUM_SEMAPHORES from 64 to 512 to accommodate semaphore indexing that grows as O(nRanksPerNode). Convert allreduce_rsag_zero_copy from a compile-time-templated kernel ({4,8} ranks) to a runtime nRanksPerNode kernel; fuse load+reduce inside the peer loop to avoid O(NPeers) register pressure that would otherwise spill at NVL72 scale. Bump AllreduceAllpairPacket::maxBlockNum_ from 28 to 72 so the adapter can launch >= nPeers blocks at MNNVL scale. Fix a shared-memory channel-cache bug across five kernels: nvls_zero_copy, nvls_warp_pipeline, packet, allreduce_fullmesh, and allgather_fullmesh. The original 'if (lid < nPeers) channels[lid] = ...' load only populated the first WARP_SIZE entries, but threads from multiple warps later read channels[threadIdx.x] up to nPeers-1. Replace with a per-warp strided loop so every warp loads all entries before __syncwarp(); the same-value cross-warp writes are benign. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add an MNNVL rank-domain override so MSCCL++ collectives can treat multi-host NVLink fabrics as a single CUDA IPC/NVLS peer group. Update packet, RSAG, and NVLS allreduce paths to use the collective domain size and teach the torch integration tuning example to select MNNVL-capable allreduce algorithms. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add the allpair packet algorithm to the MNNVL small-message candidate set and enable zero-copy NVLS/RSAG candidates for larger symmetric-memory allreduce benchmarks. Run the standalone tuning example with symmetric memory so RawGpuBuffer-backed tensors can use the zero-copy paths. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Run the tuning example with symmetric memory disabled, make allreduce tuning use the same symmetric-memory mode as execution, and narrow the MNNVL small-message candidate set to avoid slower packet/NVLS choices. Increase packet and RSAG channel parallelism so non-symmetric CUDA-IPC paths can use 112-block packet and 128-block RSAG configs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Allow default_allreduce_nvls_zero_copy to run when the public symmetric_memory flag is false; the algorithm already binds the concrete input and output allocations in its context. Include that fast path in MNNVL tuning and bound allpair/NVLS packet candidates to small sizes so large-message no-symmetric tuning avoids slow or unsafe packet variants. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Disable NVLS zero-copy when symmetric memory is not enabled, and allow the RSAG zero-copy path to participate in MNNVL tuning for non-symmetric memory. Cache RSAG zero-copy contexts by the concrete buffer pointers so CUDA graph capture does not create a new registration for every execute call, and cap requested blocks at the channel count. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Replace MSCCLPP_MNNVL_NRANKS_PER_NODE (which overrode TcpBootstrap and silently changed getNranksPerNode() for every consumer) with a single algorithm-level helper getIpcDomainNranks(comm) backed by a new MSCCLPP_IPC_DOMAIN_NRANKS env. The neutral IPC name covers both NVLink/ MNNVL on NV and XGMI on AMD. Bootstrap is unchanged and continues to report physical-host detection. Collapse the two getCollectiveDomainNranksPerNode overloads into one canonical helper and route all six allreduce algos (packet, allpair_packet, nvls_packet, nvls_zero_copy, rsag, rsag_zero_copy) through it. Update the standalone tuning example to use the new env name; drop the undeclared MSCCLPP_ENABLE_MNNVL gate; fix multi_host_mnnvl detection now that nranks_per_node is no longer overridden by the bootstrap. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The AlgorithmCtx field and the kernel/host parameters that hold the collective's IPC peer-group size were named nRanksPerNode, which is misleading on Multi-Node NVLink (where the value spans multiple hosts) and on AMD (where the relevant fabric is XGMI, not NVLink). Rename to ipcDomainNranks throughout the collective algorithms to match the neutral naming introduced for the env helper. Scope intentionally limited to src/ext/collectives/. The following are left untouched on purpose: - Bootstrap::getNranksPerNode() — physical-host detection, semantics unchanged. - Algorithm::Constraint::nRanksPerNode (public API in include/mscclpp/algorithm.hpp) and the DSL plan config in algorithm_collection_builder.cc — these describe a plan's required physical topology. - NCCL adapter (src/ext/nccl/) — preserves NCCL ABI compatibility. - MAX_NRANKS_PER_NODE — sizing constant for shared-memory arrays. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Recovers the per-thread int4 register array + #pragma unroll for the {4, 8} rank cases. All NPeers remote reads are issued in parallel so their latency overlaps instead of being serialized by the runtime fused load+reduce loop. The runtime-domain (NVL72) fallback is removed; the algo now returns cudaErrorInvalidValue for unsupported ipcDomainNranks, and rsag_zero_copy is dropped from the MNNVL candidate list in the tuning example. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Commit 533f329 dropped the static tag counter from generateAllreduceContextKey, causing every non-symmetric call to return the same key (zero) and reuse a stale context. Restore the pre-MNNVL behavior of returning a unique key per non-symmetric call so the context cache rebuilds when buffers change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Clear recycled TokenPool entries before handing them out so device-to-device semaphores start from a clean counter value. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Two follow-ups to commit 7bc5e04: * Rename mscclpp::memset to mscclpp::gpuMemset for symmetry with gpuMemcpy / gpuMemcpyAsync, and avoid shadowing std::memset for callers that pull the namespace in. Also add the missing doc comment. * Move the per-slot zeroing from getToken() into the deleter so the cost is paid on release rather than acquire. This is safe because gpuCallocPhysical already zeros the underlying buffer at TokenPool construction, so first-time tokens are clean and recycled tokens are scrubbed on release. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Both default_allreduce_nvls_warp_pipeline and default_allreduce_nvls_block_pipeline were only partially MNNVL-aware: their kernels had been updated to use ipcDomainNranks (with shared-memory channel arrays sized for the global NVLink-domain bound), but the host-side context init still hard-coded ctx->ipcDomainNranks = bootstrap->getNranksPerNode(). On a fully populated MNNVL fabric (e.g. NVL72 where world == ipcDomainNranks but the per-physical-host nranksPerNode is much smaller), this mismatched the multicast group span and produced wrong/missing data plus out-of-bounds scratch indexing. Changes: - Rename MAX_NRANKS_PER_NODE -> MAX_IPC_DOMAIN_NRANKS to match the rest of the IPC-domain naming (getIpcDomainNranks, ipcDomainNranks, MSCCLPP_IPC_DOMAIN_NRANKS env var). Pure rename, no semantic change. - Add validateIpcDomainSpansWorld(comm, algName) helper in collective_utils that wraps getIpcDomainNranks() and asserts the IPC-domain == whole-comm invariant required by NVLS algorithms (worldSize == ipcDomainNranks, rank < ipcDomainNranks, ipcDomainNranks in [2, MAX_IPC_DOMAIN_NRANKS]), throwing Error(InvalidUsage) on violation and returning the validated value. - nvls_zero_copy / nvls_block_pipeline / nvls_warp_pipeline initialize() each now call the helper instead of repeating the same ~20-line check inline. - initAllreduceContext() in both pipelines now uses getIpcDomainNranks(comm) instead of bootstrap->getNranksPerNode(). - Per-peer base channel allocation (nBaseChannels_) is sized in initialize() as max(64, 4*ipc) for block pipeline and max(64, 8*ipc) for warp pipeline so the kernel's per-block channel addressing remains in-bounds at NVL72 scale. - Block pipeline initialize() also asserts 6*ipcDomainNranks <= NUM_SEMAPHORES. - allreduceKernelFunc() in both pipelines now validates launch shape and the user-supplied scratch buffer size before launching, returning CommInvalidArgument with a clear WARN on mismatch: - Block: nBlocks must equal 5*ipcDomainNranks (structurally required by the kernel's three-phase block partition), nThreads == 1024, inputSize aligned to (ipc * 16) bytes, scratchSizePerBlock >= unitSize. - Warp: nBlocks >= NUM_NVLS_CONNECTION and a multiple of it (kernel does nBlocks / NUM_NVLS_CONNECTION partitioning of the multicast handles), 2*nBlocks <= nBaseChannels_, nThreads == 1024 (32 warps hard-coded in the bar.sync member counts), inputSize divisible by ipcDomainNranks, scratchSizePerBlock >= copyPerIter. - Default nBlocks for warp pipeline is rounded up to a multiple of NUM_NVLS_CONNECTION so the structural constraint holds for any ipcDomainNranks. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ting Use mpi4py for bootstrap and local-rank discovery; drop the torchrun / gloo / manual MSCCLPP_MASTER_ADDR paths and the netifaces dependency. Add MNNVL/multi-node algorithm selection (rsag, rsag_zero_copy, nvls_zero_copy) and route barrier / timing-sync allreduces through the configured symmetric_memory flag so they work across hosts. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The example is now MNNVL-only: a run is either single-host (everything fits in one node) or multi-host MNNVL (one cross-host NVLink domain). Plain multi-node-without-MNNVL had its own algorithm branch that this example will never exercise, so remove the multi_node flag and the intermediate mnnvl_domain variable. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Drop the multi_host_mnnvl-specific rsag fallback in _default_ar_config; fall through to default_allreduce_packet when NVLS is unavailable. - Add SYMMETRIC_MEMORY env var so the tuning sweep can include the zero-copy NVLS / RSAG candidates without editing the source. - Make _algo() raise on miss (direct dict lookup) and drop the defensive 'if a:' guards in _ar_candidates / _ag_candidates / _default_ar_config; merge existence checks into the platform conditions (self._nvls, self.symmetric_memory). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Collapse the duplicated 3-line warp-strided-load comment in 5 kernels (allgather_fullmesh, allreduce_fullmesh, allreduce_packet, allreduce_nvls_zero_copy, allreduce_nvls_warp_pipeline) into a single one-line 'Peer count may exceed WARP_SIZE on MNNVL.' note. - Drop the algName parameter from validateIpcDomainSpansWorld; switch its 3 throws to use the THROW logger macro (LogSubsys::ALGO), which already captures file/line/function. Update the 3 callsites (nvls_block_pipeline, nvls_warp_pipeline, nvls_zero_copy) and trim the Doxygen comment accordingly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

getIpcDomainNranks now performs the range / world-size / rank checks itself and throws on violation, so the separate validateIpcDomainSpansWorld helper is unnecessary. Update the 3 NVLS callsites (block_pipeline, warp_pipeline, nvls_zero_copy) to call getIpcDomainNranks directly. The non-NVLS callers also pick up the strict validation, which is fine because they are only invoked in single-host or multi-host MNNVL scenarios where worldSize == ipcDomainNranks (the NCCL adapter's multi-node path returns nullptr, falling back to NCCL/RCCL). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…param - Restore the original two-line note about the templated peer-loop unrolling instead of the multi-paragraph rationale block. - Rename the kernel template parameter from NRanksPerNode to NRanks. The IPC domain can span multiple physical hosts under MNNVL, so the 'PerNode' suffix is misleading; NRanks matches the runtime ipcDomainNranks parameter that drives template dispatch. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The 128-block default fires only when the caller passes nBlocks=0 (i.e. no tuning). Tuning explicitly drives nBlocks via the adapter, so the historical default of 64 is fine. Keep nChannelsPerConnection_=128 so the tuner can still request up to 128 blocks for MNNVL configs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The hard-coded 72 was off by one from what the comment claims is the minimum (MAX_IPC_DOMAIN_NRANKS - 1 = 71). Express the value via the constant so the relationship is self-documenting and any future change to MAX_IPC_DOMAIN_NRANKS propagates automatically. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

allreduce_nvls_block_pipeline.cu and allreduce_nvls_warp_pipeline.cu were carrying ~45 lines of per-call invariant-checking added during the MNNVL work. Restore main's simple defaulting pattern (just `if (==0) set defaults`); incorrect inputs will manifest as CUDA errors via the existing error-handling path. Also drop the unreachable `6 * ipcDomainNranks > NUM_SEMAPHORES` throw in the block_pipeline initialize (max ipcDomainNranks=72, NUM_SEMAPHORES=512), the now-unused `<mscclpp/errors.hpp>` include, and trim the verbose comments around `nBaseChannels_` sizing in both files. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The non-symmetric rsag_zero_copy path uses an incrementing tag in its context key, so cross-rank memory registration handshakes happen on every call rather than being cached. At single-host x 8 GPUs and sizes >= 512 KB this becomes the only candidate (since nvls_zero_copy is filtered out without symmetric memory) and degrades into apparent hang. Defaulting SYMMETRIC_MEMORY=1 lets a plain `mpirun ...` invocation work out of the box; users can still override with `SYMMETRIC_MEMORY=0` to exercise the non-symmetric path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

multimem.ld_reduce on FP8 inputs accumulates in FP32 by default. The ISA also exposes an .acc::f16 variant that keeps the reduction in FP16, which is faster but lower precision. Plumb AccumT through: - include/mscclpp/switch_channel_device.hpp: Extend SwitchChannelDeviceHandle::multimemLoadReduce with an optional AccumT template parameter. When VectorType is one of the FP8 vector types (f8_e4m3x{4,8,16} / f8_e5m2x{4,8,16}) and AccumT is __half, emit the .acc::f16 form of the instruction; otherwise unchanged. - src/ext/collectives/include/allreduce/common.hpp: Make handleMultiLoadReduceStore template on AccumT and forward it to multimemLoadReduce<vectorType, AccumT>(...). - src/ext/collectives/allreduce/allreduce_nvls_zero_copy.cu: Template allreduceNvls and NvlsAdapter on AccumT and forward to handleMultiLoadReduceStore<T, AccumT>; the existing dispatch<> machinery already plumbs AccumT through from the algorithm context. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Binyang2014 and others added 25 commits April 27, 2026 20:36

Reset GPU tokens before reuse

7bc5e04

Clear recycled TokenPool entries before handing them out so device-to-device semaphores start from a clean counter value. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Merge remote-tracking branch 'origin/main' into binyli/mnnvl

987f800

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multinode NVL/NVLS supports#798

Multinode NVL/NVLS supports#798
Binyang2014 wants to merge 25 commits intomainfrom
binyli/mnnvl

Binyang2014 commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Binyang2014 commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant