Revisiting Small Batch Training for Deep Neural Networks (2024)

The team at Graphcore Research has recently been considering mini-batch stochastic gradient optimization of modern deep network architectures, comparing the test performance for different batch sizes. Our experiments show that small batch sizes produce the best results.

We have found that increasing the batch size progressively reduces the range of learning rates that provide stable convergence and acceptable test performance. Smaller batch sizes also provide more up-to-date gradient calculations, which give more stable and reliable training. The best performance has been consistently obtained for mini-batch sizes between 2 and 32. This contrasts with recent work, which is motivated by trying to induce more data parallelism to reduce training time on today’s hardware. These approaches often use mini-batch sizes in the thousands.

The training of modern deep neural networks is based on mini-batch Stochastic Gradient Descent (SGD) optimization, where each weight update relies on a small subset of training examples. The recent drive to employ progressively larger batch sizes is motivated by the desire to improve the parallelism of SGD, both to increase the efficiency on today's processors and to allow distributed implementation across a larger number of physical processors. On the other hand, the use of small batch sizes has been shown to improve generalization performance and optimization convergence (LeCun et al., 2012; Keskar et al., 2016) and requires a significantly smaller memory footprint, but needs a different type of processor to sustain full speed training.

We have investigated the training dynamics and generalization performance of small batch training for different scenarios. The main contributions of our work are the following:

We have produced an extensive set of experimental results which highlight that using small batch sizes significantly improves training stability. This results in a wider range of learning rates that provide stable convergence, while using larger batch sizes often reduces the usable range to the point that the optimal learning rate could not be used.
The results confirm that using small batch sizes achieves the best generalization performance, for a given computation cost. In all cases, the best results have been obtained with batch sizes of 32 or smaller. Often mini-batch sizes as small as 2 or 4 deliver optimal results.

Our results show that a new type of processor which is able to efficiently work on small mini-batch sizes will yield better neural network models, and faster.

Stochastic Gradient Optimization

The SGD optimization updates the network parameters $\boldsymbol{\theta}$ by computing the gradient of the loss $L(\boldsymbol{\theta})$ for a mini-batch $\mathcal{B}$ of $m$ training examples, resulting in the weight update rule

$$\boldsymbol{\theta}_{k+1} = \boldsymbol{\theta}_k - \eta \; \frac{1}{m} \sum_{i=1}^{m} \nabla_{\boldsymbol{\theta}} L_i(\boldsymbol{\theta}_k) \, ,$$

Benefits of Small Batch Training

When comparing the SGD update for a batch size $m$ with the update for a larger batch size $n \cdot m$, the crucial difference is that with the larger batch size all the $n \cdot m$ gradient calculations are performed with respect to the original point $\boldsymbol{\theta}_k$ in the parameter space. As shown in the figure below, for a small batch size $m$, for the same computation cost, the gradients for $n$ consecutive update steps are instead calculated with respect to new points $\boldsymbol{\theta}_{k+j}$, for $j = 1, ..., n - 1$.

Different Batch Sizes for Weight Update and Batch Normalization

In the following figure, we consider the effect of using small sub-batches for Batch Normalization, and larger batches for SGD. This is common practice for the case of data-parallel distributed processing, where Batch Normalization is often implemented independently on each individual processor, while the gradients for the SGD weight updates are aggregated across all workers.

The CIFAR-100 results show a general performance improvement by reducing the overall batch size for the SGD weight updates. We note that the best test accuracy for a given overall SGD batch size is consistently obtained when even smaller batches are used for Batch Normalization. This evidence suggests that to achieve the best performance both a modest overall batch size for SGD and a small batch size for Batch Normalization are required.

Why it matters

Using small batch sizes has been seen to achieve the best training stability and generalization performance, for a given computational cost, across a wide range of experiments. The results also highlight the optimization difficulties associated with large batch sizes. Overall, the experimental results support the broad conclusion that using small batch sizes for training provides benefits both in terms of the range of learning rates that provide stable convergence and the test performance for a given number of epochs.

While we are not the first to conclude that smaller mini-batch sizes give better generalization performance, current practice is geared to ever larger batch sizes because today's hardware requires a trade-off between getting more accurate results and synthesizing parallelism to fill the wide vector data-paths of today's processors, and to hide their long latencies to model data stored off-chip in DRAM.

With the arrival of new hardware specifically designed for machine intelligence, like Graphcore’s Intelligence Processing Unit (IPU), it’s time to rethink conventional wisdom on optimal batch size. With the IPU you will be able to run training efficiently even with small batches, and hence achieve both increased accuracy and faster training. In addition, because the IPU holds the entire model inside the processor, you gain an additional speed up by virtue of not having to access external memory continuously. Our benchmark performance results highlight the faster training times that can be achieved.

You can read the full paper herehttps://arxiv.org/abs/1804.07612.

Revisiting Small Batch Training for Deep Neural Networks (2024)

FAQs

What is a good batch size for deep learning? ›

It is a good practice to start with the default batch size of 32 and then try other values if you're not satisfied with the default value. It is better to try smaller batch sizes first. A large batch size typically requires a lot of computational resources to complete an epoch but requires fewer epochs to converge.

Get More Info Here ›

What are the benefits of Minibatching? ›

The main advantage of Mini-batch GD over Stochastic GD is that you can get a performance boost from hardware optimization of matrix operations. This method offers a compromise between speed and stability, making it a popular choice in deep learning applications.

Explore More ›

What is the benefit of batching your data into mini batches versus using the entire dataset to optimize the model all at once? ›

Stochasticity: The use of mini batches introduces stochasticity into the optimization process, which can help the model avoid getting stuck in local minima and explore the solution space more effectively.

Show Me More ›

When training a deep learning network what is the impact of using smaller mini batch sizes? ›

Batch size is a slider on the learning process. Small values give a learning process that converges quickly at the cost of noise in the training process. Large values give a learning process that converges slowly with accurate estimates of the error gradient.

Learn More ›

Does batch size affect overfitting? ›

The batch size can be understood as a trade-off between accuracy and speed. Large batch sizes can lead to faster training times but may result in lower accuracy and overfitting, while smaller batch sizes can provide better accuracy, but can be computationally expensive and time-consuming.

Discover More Details ›

Is a smaller batch size better? ›

This paper claims that large-batch methods tend to converge to sharp minimizers of the training and testing functions–and that sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers.

Discover More Details ›

What is minibatching? ›

The mini-batch is a fixed number of training examples that is less than the actual dataset. So, in each iteration, we train the network on a different group of samples until all samples of the dataset are used.

Why do we use batches in deep learning? ›

Another benefit of batch processing in ML is that it can improve the accuracy and stability of your model training and optimization. By using batches, you can introduce some randomness and noise into your data, which can prevent overfitting and improve generalization.

Learn More ›

How many epochs to train a neural network? ›

The number of epochs is a hyperparameter that must be decided before training begins. A larger number of epochs does not necessarily lead to better results. Generally, a number of 11 epochs is ideal for training on most datasets. Learning optimization is based on the iterative process of gradient descent.

Read On ›

How does mini batch size affect training? ›

The authors of the paper, “On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima”, claim that it is because Large Batch methods tend to result in models that get stuck in local minima. The idea is that smaller batches are more likely to push out local minima and find the Global Minima.

Find Out More ›

What are two reasons why small batch sizes are important and beneficial? ›

Manufacturers have more control over the process, from start to finish. They can monitor the process more closely and make changes where necessary. This results in a more consistent final product. Additionally, small batch producers can tailor their products to meet specific customer needs.

Explore More ›

What happens if we increase the batch size in the deep learning? ›

Increasing the batch size reduces the number of parameter updates during training from just over 14000 to below 6000. Note that the training curves appear unusually noisy because we reduced the number of test set evaluations to reduce the model training time.

Get More Info ›

Why small batch sizes are important and beneficial? ›

Attending to batch size is important for several reasons. The most common one is that small batches can be done quickly, are easier to manage and can be delivered to customers quickly. But small batches also force product managers to focus on what's most important.

Discover More Details ›

What is the ideal batch size? ›

But generally, the size of 32 is a rule of thumb and a good initial choice.

Find Out More ›

What should be the ideal batch size? ›

A good starting point is to choose a small batch size, such as 32 or 64, that can fit in your GPU or CPU memory and that can provide a reasonable balance between speed and accuracy.

Is batch size 128 too large? ›

Hi, I'd recommend you start with a batch size that's close to what others have used, maybe on the high end of that (say, 64 or 128). In practice, training with too large a batch size is said to not generalize well to new data (overfitting).

Is 512 batch size good? ›

This blog post argues that 512 is a better batch size (on a totally different application) but honestly, I'll try anything since the training looks like it will take 45 minutes, and I have to babysit it with frequent interactions with the browser or it self-cancels.

Explore More ›