Skip to content

Latest commit

 

History

History
87 lines (64 loc) · 2.86 KB

File metadata and controls

87 lines (64 loc) · 2.86 KB

ZenFlow Llama-2 Fine-Tuning Example

This project demonstrates how to fine-tune a Llama-2 model using DeepSpeed with ZenFlow, a stall-free offloading engine for large-scale model training.

Quick Start

  1. Install dependencies
pip install -r requirements.txt
  1. Configure training

Edit zf_config.json to enable ZenFlow:

"zero_optimization": {
    "stage": 2,
    "offload_optimizer": {
    "device": "cpu",
    "pin_memory": true
    },
    "zenflow": {
        "topk_ratio": 0.1,
        "update_interval": 4,
        "full_warm_up_rounds": 0,
        "overlap_step": true
    }
}
  1. Run fine-tuning
bash finetune_llama.sh

This runs LLaMA-2 fine-tuning on Alpaca-52K using DeepSpeed + ZenFlow, saving checkpoints to ./alpaca_output.

Example Output

Below is a sample log showing step time and loss values. You can see significant speedup after the first full step:

ZenFlowCPUAdam initialized with overlap step.
Step 5, Loss: 1.2599, Time: 719.58ms 
Step 6, Loss: 0.9847, Time: 702.81ms <-- gradient accumulation with overlapped update
Step 7, Loss: 0.6220, Time: 705.50ms
Step 8, Loss: 0.5173, Time: 1912.92ms <-- full optimizer step of remaining part and update parameters
Step 9, Loss: 0.4557, Time: 890.60ms
Step 10, Loss: 0.3882, Time: 740.11ms
Step 11, Loss: 0.3627, Time: 731.95ms
Step 12, Loss: 0.3341, Time: 2221.18ms
Step 13, Loss: 0.2453, Time: 1061.80ms

Key Insight

Steps like 5,6 and 7 are accumulation steps where ZenFlow overlaps part of the optimizer step in the background. These steps remain fast (~700ms).

Step 8 performs the remaining part of optimizer step and updates parameters to the GPU (2–2.2s).

Without ZenFlow, a full update would take nearly 4 seconds, and ZenFlow distributes half of this cost across earlier accumulation steps via asynchronous overlap.

This demonstrates how ZenFlow hides much of the CPU offload cost, enabling near stall-free training. Crucially, ZenFlow not only overlaps the CPU optimizer step but also maintains training progress on the GPU by immediately updating the most important gradients.

Notes

  • To change model, batch size, or epochs, modify finetune_llama.sh.
  • All DeepSpeed and ZenFlow options are controlled via zf_config.json.

Citation

To cite ZenFlow, please cite our arxiv report:

@misc{lan2025zenflowenablingstallfreeoffloading,
      title={ZenFlow: Enabling Stall-Free Offloading Training via Asynchronous Updates}, 
      author={Tingfeng Lan and Yusen Wu and Bin Ma and Zhaoyuan Su and Rui Yang and Tekin Bicer and Masahiro Tanaka and Olatunji Ruwase and Dong Li and Yue Cheng},
      year={2025},
      eprint={2505.12242},
      archivePrefix={arXiv},
      primaryClass={cs.DC},
      url={https://arxiv.org/abs/2505.12242}, 
}