Goal: Train a production-ready speech-to-speech AI model from scratch, independent of any external APIs.
Budget: $300-1,700 depending on dataset size and training iterations
Timeline: 2-8 weeks depending on GPU availability and dataset preparation
- Quick Start (Get Training in 30 minutes)
- System Requirements
- Data Preparation
- Phase 1: Train Speech Tokenizer
- Phase 2: Train Hybrid S2S Model
- Phase 3: Add Emotional Control
- Deployment to Production
- Cost Breakdown
- Troubleshooting
# 1. Launch RunPod A100 80GB pod (Secure Cloud or Community)
# 2. Connect via SSH or Jupyter
# 3. Clone repository
cd /workspace
git clone https://github.com/devasphn/Testing-S2S.git
cd Testing-S2S
# 4. Install dependencies
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
# Install PyTorch with CUDA 12.1
pip install torch==2.3.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu121
# Install project dependencies
pip install -r requirements.txt
pip install -r requirements-training.txt
# 5. Download LibriSpeech dataset (100 hours)
mkdir -p /workspace/data
cd /workspace/data
wget http://www.openslr.org/resources/12/train-clean-100.tar.gz
tar -xzf train-clean-100.tar.gz
# Optional: Download larger datasets for better quality
# wget http://www.openslr.org/resources/12/train-clean-360.tar.gz
# wget http://www.openslr.org/resources/12/train-other-500.tar.gz
# 6. Update config with your data path
cd /workspace/Testing-S2S
nano training/configs/tokenizer_config.yaml
# Change data_dir to: /workspace/data
# 7. Start training!
python training/train_tokenizer.py --config training/configs/tokenizer_config.yaml
# 8. Monitor training
# - Watch console output
# - Check WandB dashboard (if enabled)
# - Monitor GPU: watch -n 1 nvidia-smiExpected Output:
[DEVICE] Using cuda
[DATASET] Loaded 28539 files from train-clean-100
[DATA] Train: 27112, Val: 1427
[MODEL] Parameters: 42,853,376
Epoch 1/100
============================================================
Epoch 1: 100%|███████| 1695/1695 [05:12<00:00, 5.42it/s, loss=0.3245, mel_l1=0.2891, commit=0.0354]
Validation: 100%|███████| 90/90 [00:18<00:00, 4.89it/s, val_loss=0.2987]
Epoch 1 Summary:
Train Loss: 0.3245
Val Loss: 0.2987
LR: 1.00e-04
✓ Saved best model (val_loss: 0.2987)
- GPU: NVIDIA A40 (48GB VRAM) or better
- RAM: 32GB system RAM
- Storage: 200GB SSD
- Network: High-speed for dataset downloads (100GB+ datasets)
- GPU: NVIDIA A100 80GB
- RAM: 64GB system RAM
- Storage: 500GB NVMe SSD
- Network: 1 Gbps+
- OS: Ubuntu 22.04 LTS (or RunPod Docker image)
- Python: 3.10+
- CUDA: 12.1+
- PyTorch: 2.3.0+
Datasets:
train-clean-100: 100 hours, clean speech ($10 training cost)train-clean-360: 360 hours, clean speech ($40 training cost)train-other-500: 500 hours, diverse speakers ($60 training cost)
Download:
cd /workspace/data
# Small (100h) - Good for testing
wget http://www.openslr.org/resources/12/train-clean-100.tar.gz
tar -xzf train-clean-100.tar.gz
# Medium (360h) - Good quality
wget http://www.openslr.org/resources/12/train-clean-360.tar.gz
tar -xzf train-clean-360.tar.gz
# Large (500h) - Best quality
wget http://www.openslr.org/resources/12/train-other-500.tar.gz
tar -xzf train-other-500.tar.gzLanguages: 100+ including Hindi, Tamil, Telugu, Bengali, Marathi, Gujarati, Kannada, Malayalam
Download:
- Go to https://commonvoice.mozilla.org/datasets
- Select languages (e.g., Hindi, English)
- Download
.tar.gzfiles - Extract to
/workspace/data/CommonVoice/
Note: Common Voice requires preprocessing. LibriSpeech is ready to use.
For conversational data (needed for S2S training):
# Generate synthetic conversations using GPT-4 + TTS
python scripts/generate_synthetic_conversations.py \
--num_samples 10000 \
--output_dir /workspace/data/synthetic \
--openai_api_key YOUR_KEY
# Cost: ~$100-300 for 10K conversation pairsThe speech tokenizer is like a "speech compression" model that converts audio into discrete tokens (similar to how JPEG compresses images).
Architecture:
Audio (24kHz) → Mel Spectrogram → CNN Encoder → RVQ (8 quantizers) → CNN Decoder → Mel → HiFiGAN Vocoder → Audio
1. Prepare Config:
# training/configs/tokenizer_config.yaml
model:
sample_rate: 24000 # Match Luna AI
codebook_size: 1024
hidden_dim: 512
num_quantizers: 8
data:
data_dir: "/workspace/data"
train_split: "train-clean-100" # or train-clean-360
training:
epochs: 100 # Increase to 200 for better quality
batch_size: 16 # Adjust for your GPU (8 for A40, 16-32 for A100)
learning_rate: 1e-42. Start Training:
cd /workspace/Testing-S2S
source venv/bin/activate
# Basic training
python training/train_tokenizer.py
# With custom config
python training/train_tokenizer.py --config my_config.yaml
# Resume from checkpoint
python training/train_tokenizer.py --resume checkpoints/tokenizer/tokenizer_epoch_50.pt3. Monitor Progress:
# Terminal 1: Training logs
python training/train_tokenizer.py
# Terminal 2: GPU monitoring
watch -n 1 nvidia-smi
# Terminal 3: Tensorboard (if not using wandb)
tensorboard --logdir checkpoints/tokenizer/logs --port 60064. Evaluate Quality:
# Test tokenizer on sample audio
python scripts/test_tokenizer.py \
--checkpoint checkpoints/tokenizer/tokenizer_best.pt \
--input test_audio.wav \
--output reconstructed.wav
# Listen to reconstructed audio
# Good quality: Clear speech, minimal artifacts
# Bad quality: Muffled, robotic, missing details| Dataset | Training Time | Cost | Quality |
|---|---|---|---|
| train-clean-100 (100h) | 8-12 hours | $10-15 | Good for testing |
| train-clean-360 (360h) | 30-40 hours | $36-48 | Production-ready |
| train-clean-100+360+500 (960h) | 80-120 hours | $95-143 | Excellent |
Metrics to Watch:
- Mel L1 Loss: Should drop below 0.15 (lower = better)
- Commitment Loss: Should stabilize around 0.01-0.05
- Validation Loss: Should decrease steadily without overfitting
Coming in next commit: Complete S2S training pipeline
Preview:
- Uses trained tokenizer from Phase 1
- Trained on conversational data
- Architecture similar to Moshi/GLM-4-Voice
- Estimated cost: $300-500
Coming in next commit: Emotion fine-tuning pipeline
Preview:
- Fine-tune on IEMOCAP emotional speech dataset
- Add emotion embedding layer
- Support for laughter, sighs, whispers
- Estimated cost: $100-200
Once training is complete, integrate trained models into your existing FastAPI server:
# src/server.py
from src.models.speech_tokenizer_trainable import TrainableSpeechTokenizer
from src.models.hybrid_s2s import HybridS2SModel
# Load trained models
tokenizer = TrainableSpeechTokenizer(
checkpoint_path="/workspace/checkpoints/tokenizer/tokenizer_best.pt"
).to(device)
s2s_model = HybridS2SModel(
checkpoint_path="/workspace/checkpoints/s2s/s2s_best.pt"
).to(device)
# Use in inference
@app.websocket("/ws/stream")
async def websocket_endpoint(websocket: WebSocket):
# ... your existing streaming logic
# Tokenize user audio
user_tokens = tokenizer.tokenize(user_audio)
# Generate AI response tokens
ai_tokens = s2s_model.generate_streaming(user_tokens)
# Detokenize to audio
ai_audio = tokenizer.detokenize(ai_tokens)
# Stream back to user
await websocket.send_bytes(ai_audio)- Tokenizer: LibriSpeech 100h, 10 hours training = $12
- S2S Model: Synthetic 5K pairs, 100 hours training = $119
- Emotion Fine-tuning: IEMOCAP, 50 hours = $60
- Experimentation: 100 hours = $119
- Total: ~$310
- Tokenizer: LibriSpeech 960h, 120 hours training = $143
- S2S Model: Synthetic 20K pairs, 400 hours training = $476
- Emotion Fine-tuning: Multiple datasets, 150 hours = $179
- Experimentation: 500 hours = $595
- Total: ~$1,393
- A100 80GB (Secure Cloud): $1.19/hr
- A100 80GB (Community): $0.89/hr (40% cheaper!)
- A40 48GB (Community): $0.49/hr (good for inference)
Cost Optimization Tips:
- Use Community Cloud (40% cheaper, slightly less reliable)
- Use Spot Instances for training (70% cheaper, can be interrupted)
- Train tokenizer on smaller dataset first, validate, then scale up
- Use mixed precision training (fp16) to fit larger batches
Problem: Model starts with random weights
Solution:
# Check if checkpoint exists
ls -la checkpoints/tokenizer/
# If missing, train first
python training/train_tokenizer.pyProblem: GPU memory exhausted
Solutions:
-
Reduce batch size in config:
training: batch_size: 8 # Instead of 16
-
Use gradient accumulation:
training: batch_size: 8 gradient_accumulation_steps: 2 # Effective batch size = 16
-
Use mixed precision:
# In training script from torch.cuda.amp import autocast, GradScaler scaler = GradScaler() with autocast(): outputs = model(audio)
Problem: Model not learning
Solutions:
- Check learning rate (try 1e-5 to 1e-3 range)
- Verify data preprocessing is correct
- Reduce model size temporarily for debugging
- Check for NaN gradients:
torch.isnan(loss).any()
Problem: Reconstructed audio is muffled or robotic
Solutions:
- Train longer (more epochs)
- Increase model capacity (hidden_dim: 768 instead of 512)
- Use larger dataset (960h instead of 100h)
- Adjust commitment_weight in config (try 0.1 to 0.5)
Immediate (This Week):
- ✓ Set up RunPod environment
- ✓ Download LibriSpeech dataset
- ✓ Start tokenizer training
- Monitor first 10 epochs, validate quality
Short-term (Next 2 Weeks):
- Complete tokenizer training
- Test reconstruction quality
- Prepare conversational dataset for S2S
- Start S2S model training
Medium-term (Next Month):
- Train full S2S pipeline
- Add emotion control
- Deploy to production
- Collect real user feedback
Long-term (Next 3 Months):
- Fine-tune on Indian languages
- Add breathing sounds and advanced prosody
- Scale to multiple voices/personalities
- Optimize for <500ms latency
GitHub Repository: https://github.com/devasphn/Testing-S2S
Useful Links:
- LibriSpeech Dataset: http://www.openslr.org/12/
- RunPod Documentation: https://docs.runpod.io/
- WandB Monitoring: https://wandb.ai/
- PyTorch Tutorials: https://pytorch.org/tutorials/
Community:
- Open issues on GitHub for bugs
- Join WandB workspace for training logs
- Share checkpoints with team via RunPod Network Storage
Last Updated: November 17, 2025
Version: 1.0 (Tokenizer Training Phase)
Status: ✓ Ready to start training!