Skip to content

feat: support ReduceScatter with OpenMPI backend implementation#12

Merged
Ziminli merged 3 commits into
InfiniTensor:masterfrom
halfman510:feat/support-reducescatter
May 19, 2026
Merged

feat: support ReduceScatter with OpenMPI backend implementation#12
Ziminli merged 3 commits into
InfiniTensor:masterfrom
halfman510:feat/support-reducescatter

Conversation

@halfman510
Copy link
Copy Markdown
Collaborator

Summary

This PR introduces an OpenMPI-based implementation of ReduceScatter, along with a complete example program for functionality verification and basic performance evaluation.

Changes

  • OpenMPI-based ReduceScatter Implementation
    • add the basic OpenMPI implementation for infiniReduceScatter(), including:
      • the core interface src/base/reduce_scatter.h;
      • the OpenMPI backend implementation in src/ompi/impl/reduce_scatter.h;
      • the public API declaration in include/infiniccl.h.
    • add an example program examples/reduce_scatter.cc similar to examples/all_reduce.cc for correctness verification and simple performance testing.

Known Issues & Future Work

  • The current OpenMPI AllGather implementation uses blocking MPI_Reduce_scatter_block, which prevents overlap between communication and computation. Future work may introduce non-blocking collectives (MPI_Reduce_scatter_block) and stream-aware asynchronous execution to improve concurrency and performance.
  • The current implementation allocates temporary host staging buffers using malloc/free on every invocation. This may introduce noticeable overhead in high-frequency workloads. Future work may add reusable buffer pools, allocator caching, and pinned host memory support to improve transfer efficiency and reduce allocation overhead.
  • For the heterogeneous UCX + InfiniBand cluster used in testing, large AllGather messages (e.g., 1 << 20 elements) may fail with mlx5 RC RDMA_READ errors due to UCX rendezvous RDMA_READ path limitations. This requires setting UCX_RNDV_SCHEME=put_zcopy to force a safe put-based transfer protocol. Without this setting, large-message AllGather execution is unstable on some NIC configurations.
  • Averaging (kAvg) is performed via a CPU-side loop after the MPI call. While functionally correct, this is not optimal for large recv_count. Future work may move scaling into the MPI operation (where supported) or use a more efficient vectorized/device-side post-processing step.
  • recv_count is cast to int for MPI (with a safety check). Extremely large messages exceeding INT_MAX elements are rejected. This is acceptable for current use cases but may need MPI_Count support in future MPI-4+ integrations for very large tensors.

Logs & Screenshots

@Ziminli Ziminli changed the title Feat: support ReduceScatter with OpenMPI backend implementationFeat/support reducescatter feat: support ReduceScatter with OpenMPI backend implementation May 18, 2026
Comment thread examples/reduce_scatter.cc Outdated
Modified file:
- `include/comm.h`

Added files:
- `src/base/reduce_scatter.h`
- `src/ompi/impl/reduce_scatter.h`
- `examples/reduce_scatter.cc`
@halfman510 halfman510 force-pushed the feat/support-reducescatter branch from a1e30e7 to af38416 Compare May 19, 2026 06:35
@Ziminli Ziminli merged commit 75c184e into InfiniTensor:master May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants