GPU Requirements

Memory by model and mode

Model	Mode	Min VRAM per GPU	Recommended
π₀.₅ (1.3B)	LoRA (`deepspeed`)	12 GB	RTX 3090 / A10G / T4
π₀.₅ (1.3B)	Full (`fsdp`)	40 GB	4×A100 40GB
π₀ (3B)	LoRA (`deepspeed`)	24 GB	RTX 4090 / A100 40GB
π₀ (3B)	Full (`fsdp`)	80 GB	4×A100 80GB / 4×H100

Apply these fixes in order until training fits:

Reduce batch_size to 1 in your training config
Increase grad_accum_steps to maintain the same effective batch size
Enable gradient_checkpointing: true in the policy section (~20% slower, saves ~30% VRAM)
Reduce lora.r from 16 to 8 — halves LoRA parameter count
Switch to QLoRA — set lora.use_qlora: true (4-bit base model, fits π₀.₅ in 8 GB)
Full fine-tuning only — enable fsdp_activation_checkpointing: true in configs/fsdp_config.yaml