Skip to content

Sglang backend#549

Open
pmukeshreddy wants to merge 5 commits intoOpenPipe:mainfrom
pmukeshreddy:sglang-backend
Open

Sglang backend#549
pmukeshreddy wants to merge 5 commits intoOpenPipe:mainfrom
pmukeshreddy:sglang-backend

Conversation

@pmukeshreddy
Copy link

@pmukeshreddy pmukeshreddy commented Feb 4, 2026

Summary

Adds SGLang as an alternative inference backend to vLLM, optimized for RL training workloads. SGLang's RadixAttention provides automatic prefix caching that significantly improves performance for multi-turn agent trajectories.

Motivation

In RL training loops, many rollouts share the same system prompt and conversation prefix. SGLang's RadixAttention automatically caches these common prefixes, providing:

  • ~29% higher throughput vs vLLM (external benchmarks)
  • ~22% lower TTFT (time to first token)
  • Persistent KV cache across training steps (no restart needed)

Architecture

┌─────────────────────────────────────────────────────────┐
│           Multi-GPU Split Architecture                   │
├─────────────────────────────────────────────────────────┤
│  GPU 0: SGLang Server    │  GPU 1+: Training (Unsloth)  │
│  - RadixAttention cache  │  - PEFT/LoRA model           │
│  - OpenAI-compatible API │  - GRPO optimizer            │
│  - LoRA hot-reload       │  - Gradient computation      │
└─────────────────────────────────────────────────────────┘
        Weight Sync: Hot-reload via HTTP API (~5-10s)

Key design decision: SGLang server runs in a separate Python environment to avoid torchao version conflicts (SGLang needs torchao==0.9, Unsloth needs torchao>=0.13). Communication happens via HTTP, not Python imports.

Features

  • Drop-in replacement for LocalBackend
  • Auto GPU detection: Uses split-mode on 2+ GPUs, fallback on single GPU
  • LoRA hot-reload: Updates weights without server restart (preserves cache)
  • Two-environment architecture: Avoids all dependency conflicts
  • Comprehensive documentation: See docs/sglang-integration.md

Usage

from art.sglang_backend import SGLangBackend

# Auto-detects GPUs and configures optimally
backend = SGLangBackend()
await backend.register(model)
result = await backend.train(model, trajectory_groups)

Testing

Tested end-to-end on 4x H100 80GB:

# Fresh server test
git clone -b sglang-backend https://github.com/pmukeshreddy/ART.git
cd ART

# Install Python 3.11
apt install -y software-properties-common
add-apt-repository -y ppa:deadsnakes/ppa
apt update
apt install -y python3.11 python3.11-venv python3.11-dev

# Create main venv for ART package
python3.11 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip wheel
pip install -e ".[sglang]"
deactivate

# Create separate SGLang server venv (REQUIRED - different torch/cuda versions)
python3.11 -m venv .venv-sglang-server
source .venv-sglang-server/bin/activate
pip install --upgrade pip wheel
pip install torch --index-url https://download.pytorch.org/whl/cu124
pip install "sglang[all]>=0.4.0" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python
deactivate

# Run the E2E test
source .venv/bin/activate
python scripts/test_sglang_e2e.py

Test output:

[1/7] Importing modules...              ✓
[2/7] Checking SGLang server environment... ✓
[3/7] Initializing SGLangBackend...     ✓
[4/7] Registering model...              ✓
[5/7] Starting server and testing inference... ✓
[6/7] Running training step...          ✓
[7/7] Testing inference after training... ✓

ALL TESTS PASSED!

References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant