Alps Extended Images

Container images that extend NVIDIA NGC base images with a fully-optimized HPC networking stack tailored for the Alps supercomputer at CSCS. The images replace the bundled HPC-X components in NGC containers with libraries compiled specifically for the Slingshot CXI interconnect, enabling efficient GPU-accelerated collective communication across the Alps fabric.

Image pipeline managed via: https://cicd-ext-mw.cscs.ch

Overview

NVIDIA NGC images ship with generic HPC libraries that are not optimized for the Slingshot network fabric used on Alps. This project rebuilds the full HPC networking stack — libfabric, NCCL, NVSHMEM, UCX, UCC, OpenMPI, and their transitive dependencies — against the CXI provider and installs the result on top of each supported NGC base image.

The resulting images are validated on multi-node Slurm allocations (clariden-gh200) before being promoted to stable registries.

Image Variants

NGC Base Images

Each variant corresponds to an NGC container extended with the Alps HPC stack:

Variant	NGC Base	Use Case
`pytorch-25.12-py3-alps4`	`nvcr.io/nvidia/pytorch:25.12-py3`	GPU-accelerated PyTorch workloads
`pytorch-26.01-py3-alps4`	`nvcr.io/nvidia/pytorch:26.01-py3`	GPU-accelerated PyTorch workloads
`pytorch-26.02-py3-alps4`	`nvcr.io/nvidia/pytorch:26.02-py3`	GPU-accelerated PyTorch workloads
`nemo-25.11.01-alps4`	`nvcr.io/nvidia/nemo:25.11.01`	Speech & language model training
`nemo-26.02-alps4`	`nvcr.io/nvidia/nemo:26.02`	Speech & language model training
`physicsnemo-25.11-alps4`	`nvcr.io/nvidia/physicsnemo/physicsnemo:25.11`	Physics-informed neural networks

Application Images

Application images are built on top of the NGC base images and include additional software for specific workloads:

Image	Base	Description
`apertus-1p5-alps4`	`pytorch-26.02-py3`	Megatron-LM distributed LLM pretraining
`apertus-2-alps4`	`pytorch-26.02-py3`	Multi-model ML benchmark suite (pplx-garden, DeepEP, quack-kernels)

HPC Stack Components

The common/install-alps-hpc-stack.sh script builds and installs the following libraries:

Component	Version	Purpose
libfabric (CXI provider)	2.5.0	High-speed network fabric abstraction for Slingshot
NCCL	2.28.7-1	NVIDIA collective communications (allreduce, alltoall, …)
aws-ofi-nccl	1.17.3	Routes NCCL traffic over libfabric/OFI
NVSHMEM	3.6.5-0	GPU symmetric heap memory for peer-to-peer transfers
UCX	1.20.0	Unified Communication X transport layer
UCC	1.7.0	Unified Collective Communications abstraction
OpenMPI	5.0.10	MPI implementation linked against OFI and UCX
GDRCopy	2.5.1	GPU Direct RDMA copy utilities
XPMEM	—	Cross-process memory regions for intra-node GPU sharing
NCCL Tests	2.18.2	Collective benchmark suite
OSU Micro-benchmarks	7.5.2	Point-to-point latency and bandwidth measurements

All components are compiled with CUDA support (auto-detected) and architecture-specific flags for NVIDIA Hopper (SM90/SM90a).

Patches for upstream issues in libfabric, NCCL, and aws-ofi-nccl are maintained under patches/.

Runtime Environment

common/alps-runtime.env configures the runtime environment for Slingshot-based collective communication:

NCCL: uses the AWS libfabric transport (NCCL_NET=AWS Libfabric), protocol tuning, 4 channels per peer
CXI / libfabric: provider selection, memory registration caching, rendezvous and RX match-mode settings
NVSHMEM: libfabric remote transport over the Cassini provider, CUDA VMM disabled
OpenMPI / PMIX: security modules, byte transfer layer restricted to supported backends
CUDA: JIT cache disabled for shared-filesystem compatibility

CI/CD Pipeline

The GitLab CI pipeline (ci-pipelines/build-alps-extended-images.yaml) runs five stages:

build-base — builds NGC+HPC extended base images; uses content hashing to skip unchanged variants
test-base — validates base images on 2–4 node Slurm allocations:
- environment variable checks (FI_PROVIDER, NCCL settings)
- collective benchmarks (NCCL alltoall, NVSHMEM latency, OSU bandwidth)
- hardware verification via the vetnode framework
build-apps — builds application images on top of promoted base images
test-apps — runs end-to-end workload tests:
- apertus-1p5: Megatron pretraining (2 nodes, 8 GPUs)
- apertus-2/pplx-garden: perplexity garden benchmarks (2 nodes, 2 GPUs)
- apertus-2/DeepEP: DeepEP benchmarks (1 node, 1 GPU)
publish — promotes all tested images to stable registries; overwrites are blocked on existing stable tags

Image tagging strategy: each image name encodes a SHA256 hash of its source files, allowing the pipeline to detect unchanged inputs and skip unnecessary rebuilds.

Acknowledgements

Alps extended base images have been developed in collaboration with the Swiss AI engineers. Special thanks to @EduardDurech for the many contributions ranging from discovering bottlenecks and major bugs to patching underlying libraries.

Name		Name	Last commit message	Last commit date
Latest commit History 140 Commits
Alps-Images		Alps-Images
ci-pipelines		ci-pipelines
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Alps Extended Images

Overview

Image Variants

NGC Base Images

Application Images

HPC Stack Components

Runtime Environment

CI/CD Pipeline

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Alps Extended Images

Overview

Image Variants

NGC Base Images

Application Images

HPC Stack Components

Runtime Environment

CI/CD Pipeline

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages