Stable Diffusion fine-tuning terminology: batches, epochs, steps, timesteps, and gradient accumulation steps.

Jess Ferments
2 min readMay 11, 2024

--

In the context of fine-tuning Stable Diffusion models, you will come across many terms that are easy to get confused as a beginner. What is the difference between a step, a timestep, and a gradient accumulation step? Each of these common terms is described below, explaining how they relate to each other, and what role they play in training Stable Diffusion models:

  1. Steps: A step refers to a single iteration in which the model’s parameters are updated. This occurs after processing one batch of data. If your training dataset consists of 1000 samples and you are using a batch size of 100, then it takes 10 steps to go through the entire dataset once.
  2. Epochs: An epoch represents a full pass through the entire training dataset. Using the earlier example, once all 1000 samples have been processed in batches of 100, one epoch is completed. Typically, models are trained for multiple epochs to allow the learning algorithm multiple opportunities to update parameters and refine its understanding of the data.
  3. Batches: A batch is a subset of the training dataset that is used to perform one update of the model’s parameters. The size of a batch, often just called batch size, is a crucial hyperparameter in training neural networks. It affects the stability of the learning process, the quality of the model, and the computational efficiency. Larger batch sizes provide more stable and accurate gradient estimates but require more memory and computation.
  4. Timesteps: A diffusion model transforms data starting from a known distribution (like Gaussian noise) and gradually denoises it across several timesteps to generate the final output. Each timestep corresponds to a stage in this reverse diffusion process. The model learns to reconstruct the data from noise step-by-step, where each timestep uses slightly less noisy data than the previous one until the original data distribution is reached.
  5. Gradient Accumulation Steps: “Gradient accumulation” is a technique used to handle situations where the desired effective batch size exceeds the limitations of available GPU memory. Instead of updating the model weights after every batch, gradients from several batches are accumulated (summed up), and then the model is updated. For example, if the effective batch size you want is 800, but your system can only handle 100 samples per batch, you can accumulate gradients over 8 batches (800/100) before applying a single weight update.

--

--