Running iCluto on a SLURM Cluster

This guide explains how to install and run iCluto on clusters using the SLURM workload manager, focusing on robust self-supervised training (DINO).

Installation

Installation on SLURM Cluster (using Spack)

If your cluster uses Spack for package management, follow these steps to set up iCluto:

1. Get Spack

Clone Spack to your home directory (if not already present) and load its environment:

git clone -c feature.manyFiles=true https://github.com/spack/spack.git
source spack/share/spack/setup-env.sh
# check the installation
spack list

2. Install and Load Python 3.11

spack install python@3.11
spack load python@3.11

3. Install iCluto

Transfer the latest .tar.gz or .whl (v0.1.9) from your laptop to the cluster, then install it in a dedicated virtual environment:

# Example: Using the source distribution
tar -xzf icluto-0.1.9.tar.gz
cd icluto-0.1.9

# Create and activate virtual environment
python -m venv venv_icluto
source venv_icluto/bin/activate

# Install iCluto and its CLI training scripts
pip install .

Robust Training & Checkpointing

For long-running training tasks (like DINO), iCluto provides a robust checkpointing system that saves the full training state, including Teacher/Student weights, Optimizer momentum, and LR Scheduler state.

Continuing Training

You can resume training from the last saved state using the --resume flag:

# Automatically find and resume from the latest checkpoint in the output folder
icluto-train-dino data/traces.npy --resume auto

# Resume from a specific checkpoint file
icluto-train-dino data/traces.npy --resume out/dino/run1/weights/dino_model_epoch50.pth

SLURM Signal Handling

The training scripts are designed to catch the SLURM SIGUSR1 signal (sent shortly before the job walltime is reached). When this signal is received, the script will exit gracefully, allowing a SLURM trap to handle job re-submission.

Automatic Job Resubmission

To handle job timeouts and pre-emptions automatically, you can use a submission script with trap resubmit EXIT.

The `STOP` File Mechanism

To manually cancel a re-submitting job, touch STOP in the output root directory. The script will detect this file on exit and prevent further auto-resubmissions.

Example: `train_dino_resubmit.sbatch`

#!/bin/bash
#SBATCH --job-name=dino_train
#SBATCH --time=4:00:00
#SBATCH --signal=USR1@60      # Send signal 60s before timeout

function resubmit {
    # .finished is created by the python script upon reaching the final epoch
    FINISHED_MARKER="$OUTPUT_DIR/$RUN_NAME/weights/dino_model.finished"
    STOP_FILE="$OUTPUT_DIR/STOP"

    if [ -f "$FINISHED_MARKER" ]; then
        echo "Training finished successfully."
    elif [ -f "$STOP_FILE" ]; then
        echo "Manual STOP detected. Cancelling auto-resubmit."
        rm "$STOP_FILE"
    else
        echo "Job not finished. Resubmitting..."
        sbatch --export=ALL "$0"
    fi
}

trap resubmit EXIT
set -e

# Load environment
source $HOME/spack/share/spack/setup-env.sh
spack load python@3.11
source $HOME/icluto_staging/icluto-0.1.9/venv_icluto/bin/activate

# Run training
icluto-train-dino data/traces.npy --output_dir "$OUTPUT_DIR" --resume auto

Job Arrays & Sweeps

For hyperparameter sweeps (e.g., testing different patch sizes), use SLURM Job Arrays. See scripts/submit_sweep.sh for an example of how to iterate through a grid and launch multiple re-submitting tasks efficiently.

Monitoring with TensorBoard

To monitor your training progress live while it runs on the cluster:

On the Cluster node: bash tensorboard --logdir out/logs/ --port 6006
On your Laptop (SSH Port Forwarding): bash ssh -L 6006:localhost:6006 your-user@cluster-login-node
Local access: Navigate to http://localhost:6006.