Guides

Selecting a GPU backend

Tsunami supports both CPU and GPU devices. To select a device, use the accelerator and devices keyword arguments in the Trainer constructor.

trainer = Trainer(accelerator = :auto) # default, selects CPU or GPU depending on availability
trainer = Trainer(accelerator = :gpu) # forces selection to GPU, errors if no GPU is available

Currently supported accelerators are :auto, :gpu, and :cpu. See the Trainer documentation for more details.

By default, Tsunami will use the first of the available GPUs and the CPU if no GPUs are present. To select a specific GPU, use the devices keyword argument:

trainer = Trainer(devices = [1])

Devices are indexed starting from 1, as in the MLDataDevices.gpu_device method used by Flux.

Selecting an automatic differentiation engine

Zygote is the default Automatic Differentiation (AD) engine in Tsunami, used for computing gradients during training. Enzyme is an alternative AD engine that can sometimes provide faster performance and differentiate through mutating functions.

To select an AD engine, use the autodiff keyword argument in the Trainer constructor:

trainer = Trainer(autodiff = :enzyme) # options are :zygote (default) and :enzyme

Gradient accumulation

Gradient accumulation is a technique that allows you to simulate larger batch sizes by accumulating gradients over multiple batches. This is useful when you want to use a large batch size but your GPU does not have enough memory.

Optimisers.jl supports gradient accumulation the AccumGrad rule:

    AccumGrad(n::Int)

A rule constructed `OptimiserChain(AccumGrad(n), Rule())` will accumulate for `n` steps, before applying Rule to the mean of these `n` gradients.

This is useful for training with effective batch sizes too large for the available memory. Instead of computing the gradient for batch size `b` at once, compute it for size `b/n` and accumulate `n` such gradients.

AccumGrad can be easily integrated into Tsunami's configure_optimisers:

using Optimisers

function Tsunami.configure_optimisers(model::Model, trainer)
    opt = OptimiserChain(AccumGrad(5), AdamW(1e-3))
    opt_state = Optimiser.setup(opt, model)
    return opt_state
end

Gradient clipping

Gradient clipping is a technique that allows you to limit the range or the norm of the gradients. This is useful to prevent exploding gradients and improve training stability.

Optimisers.jl supports gradient clipping with the ClipNorm and ClipGrad rule:

    ClipGrad(δ = 10f0)

Restricts every gradient component to obey -δ ≤ dx[i] ≤ δ.

    ClipNorm(ω = 10f0, p = 2; throw = true)

Scales any gradient array for which norm(dx, p) > ω to stay at this threshold (unless p==0).
Throws an error if the norm is infinite or NaN, which you can turn off with throw = false.

Gradient clipping can be easily integrated into Tsunami's configure_optimisers:

using Optimisers

function Tsunami.configure_optimisers(model::Model, trainer)
    opt = OptimiserChain(ClipGrad(0.1), AdamW(1e-3))
    opt_state = Optimiser.setup(opt, model)
    return opt_state
end