Running Training
Prerequisites
SCRATCH,HF_TOKEN, andDATASET_REPO_IDset in your environment (see Storage Setup and First-time Setup)- Container image available (Docker or Singularity)
# Docker (cloud / local)
docker pull frieddeli/vlash-forge:latest
# or build locally:
docker build -t vlash-forge:latest .
# Singularity (HPC)
singularity pull vlash.sif docker://frieddeli/vlash-forge:latest
Option A — Unified launcher (Docker or Singularity)
scripts/train.sh detects whether vlash.sif is present and launches via
Singularity (HPC) or Docker (cloud/local) automatically.
export SCRATCH=/your/persistent/storage
export HF_TOKEN=hf_xxx
export DATASET_REPO_ID=your-org/your-dataset
# Single GPU — LoRA (default)
./scripts/train.sh examples/train/pi05/cloud.yaml
# Multi-GPU — LoRA
NUM_GPUS=4 ./scripts/train.sh examples/train/pi05/cloud.yaml
# Full fine-tuning (40 GB+ VRAM per GPU)
TRAIN_BACKEND=fsdp ./scripts/train.sh examples/train/pi05/cloud_full.yaml
Smoke test before a full run
Validate the pipeline works before committing to 50k steps:
This confirms the dataset loads, model downloads, and a checkpoint saves — without using significant compute quota.First-run overhead
DeepSpeed and bitsandbytes compile CUDA extensions on first use (~1–3 minutes).
This is cached in $SCRATCH/.cache and does not repeat on subsequent runs.
Option B — Docker Compose (cloud, multi-GPU)
For commercial cloud VMs, SSH into the instance and run directly — no scheduler needed.
The difference from HPC: on a cloud VM you are already on the GPU node, so there is no job submission step. Set your env vars, run the command, and training starts immediately.
Option C — PBS job (NSCC ASPIRE and PBS clusters)
NSCC ASPIRE uses PBSpro. Create a job script and submit with qsub:
#!/bin/bash
#PBS -l select=1:ngpus=4
#PBS -l walltime=24:00:00
#PBS -j oe
#PBS -o logs/
export SCRATCH=/scratch/users/ntu/<your-id>
export HF_TOKEN=hf_xxx
export DATASET_REPO_ID=your-org/your-dataset
export WANDB_API_KEY=your_wandb_key # optional — remove if not using W&B
cd $PBS_O_WORKDIR
./scripts/train.sh examples/train/pi05/cloud.yaml
CUDA_VISIBLE_DEVICES on PBS
PBSpro sets GPU IDs as UUIDs (e.g. GPU-50ee0fc4-...). The container entrypoint
automatically normalises these to integer indices (0,1,2,...) required by NCCL.
No manual export needed.
Option D — SLURM job
export SCRATCH=/scratch/users/ntu/<your-id>
export HF_TOKEN=hf_xxx
export DATASET_REPO_ID=your-org/your-dataset
sbatch scripts/train_slurm.sbatch examples/train/pi05/cloud.yaml
Edit the #SBATCH directives at the top of scripts/train_slurm.sbatch to match
your cluster's partition name, GPU count, and time limit.
Option E — Kubernetes
kubectl create secret generic hf-secret --from-literal=token=<YOUR_HF_TOKEN>
kubectl apply -f k8s/training-job.yaml
kubectl logs -f job/vlash-train
W&B logging
wandb is pre-installed in the container — no additional dependencies needed.
To enable it:
- Set
wandb.enable: truein your training config - Add your API key to the job script or
.env:
export WANDB_API_KEY=your_key # PBS / SLURM job script
# or in .env for Docker Compose:
WANDB_API_KEY=your_key
W&B streams loss, grad norm, and lr every log_freq steps. If a run crashes
mid-way, the dashboard shows exactly which step it reached — the most useful
tool for diagnosing training failures.
Get your API key at wandb.ai/settings.
Debugging a failed run
| Symptom | Where to look |
|---|---|
| Job never starts | PBS: qstat -u $USER / SLURM: squeue -u $USER — check queue state |
| Job starts then exits immediately | logs/ job output file — look for missing env vars or container errors |
CUDA error or OOM |
Job log + see GPU Requirements → OOM |
| Training starts but loss is NaN | Enable W&B and check the loss curve — usually lr too high or bad data |
| Checkpoint not saved | Check $SCRATCH/outputs/ exists and has write permission |
| Model weights not downloading | Verify HF_TOKEN is set and model license is accepted (see Setup) |
Checkpoints
Checkpoints are saved to /scratch/outputs/<job_name>/checkpoints/ every save_freq steps.
The latest is symlinked at .../checkpoints/last.
To push to HuggingFace Hub automatically after training: