Revisiting Small Batch Training for Deep Neural Networks (2024)

The team at Graphcore Research has recently been considering mini-batch stochastic gradient optimization of modern deep network architectures, comparing the test performance for different batch sizes. Our experiments show that small batch sizes produce the best results.

We have found that increasing the batch size progressively reduces the range of learning rates that provide stable convergence and acceptable test performance. Smaller batch sizes also provide more up-to-date gradient calculations, which give more stable and reliable training. The best performance has been consistently obtained for mini-batch sizes between 2 and 32. This contrasts with recent work, which is motivated by trying to induce more data parallelism to reduce training time on today’s hardware. These approaches often use mini-batch sizes in the thousands.

The training of modern deep neural networks is based on mini-batch Stochastic Gradient Descent (SGD) optimization, where each weight update relies on a small subset of training examples. The recent drive to employ progressively larger batch sizes is motivated by the desire to improve the parallelism of SGD, both to increase the efficiency on today's processors and to allow distributed implementation across a larger number of physical processors. On the other hand, the use of small batch sizes has been shown to improve generalization performance and optimization convergence (LeCun et al., 2012; Keskar et al., 2016) and requires a significantly smaller memory footprint, but needs a different type of processor to sustain full speed training.

We have investigated the training dynamics and generalization performance of small batch training for different scenarios. The main contributions of our work are the following:

  • We have produced an extensive set of experimental results which highlight that using small batch sizes significantly improves training stability. This results in a wider range of learning rates that provide stable convergence, while using larger batch sizes often reduces the usable range to the point that the optimal learning rate could not be used.
  • The results confirm that using small batch sizes achieves the best generalization performance, for a given computation cost. In all cases, the best results have been obtained with batch sizes of 32 or smaller. Often mini-batch sizes as small as 2 or 4 deliver optimal results.

Our results show that a new type of processor which is able to efficiently work on small mini-batch sizes will yield better neural network models, and faster.

Stochastic Gradient Optimization

The SGD optimization updates the network parameters $\boldsymbol{\theta}$ by computing the gradient of the loss $L(\boldsymbol{\theta})$ for a mini-batch $\mathcal{B}$ of $m$ training examples, resulting in the weight update rule

$$\boldsymbol{\theta}_{k+1} = \boldsymbol{\theta}_k - \eta \; \frac{1}{m} \sum_{i=1}^{m} \nabla_{\boldsymbol{\theta}} L_i(\boldsymbol{\theta}_k) \, ,$$

where $\eta \;$ denotes the learning rate.

For a given batch size $m$ the expected value of the weight update per training example (i.e., per gradient calculation $\nabla_{\boldsymbol{\theta}} L_i(\boldsymbol{\theta})$) is proportional to $\eta/m$. This implies that a linear increase of the learning rate $\eta$ with the batch size $m$ is required to keep the mean weight update per training example constant.

This is achieved by the linear scaling rule, which has been recently widely adopted (e.g., Goyal et al., 2017). Here we suggest that, as discussed by Wilson & Martinez (2003), it is clearer to define the SGD parameter update rule in terms of a fixed base learning rate $\tilde{\eta} = \eta / m$, which corresponds to using the sum instead of the average of the local gradients

$$\boldsymbol{\theta}_{k+1} = \boldsymbol{\theta}_k - \tilde{\eta} \; \sum_{i=1}^{m} \nabla_{\boldsymbol{\theta}} L_i(\boldsymbol{\theta}_k) \, .$$

In this case, if the batch size $m$ is increased, the mean SGD weight update per training example is kept constant by simply maintaining a constant learning rate $\tilde{\eta}$.

At the same time, the variance of the parameter update scales linearly with the quantity $\eta^2/m = \tilde{\eta} ^2 \cdot m \, $ (Hoffer et al., 2017). Therefore, keeping the base learning rate $\tilde{\eta}$ constant implies a linear increase of the variance with the batch size $m$.

Benefits of Small Batch Training

When comparing the SGD update for a batch size $m$ with the update for a larger batch size $n \cdot m$, the crucial difference is that with the larger batch size all the $n \cdot m$ gradient calculations are performed with respect to the original point $\boldsymbol{\theta}_k$ in the parameter space. As shown in the figure below, for a small batch size $m$, for the same computation cost, the gradients for $n$ consecutive update steps are instead calculated with respect to new points $\boldsymbol{\theta}_{k+j}$, for $j = 1, ..., n - 1$.

Therefore, under the assumption of constant base learning rate $\tilde{\eta}$, large batch training can be considered to be an approximation of small batch methods that trades increased parallelism for stale gradients (Wilson & Martinez ,2003).

Revisiting Small Batch Training for Deep Neural Networks (1)

Small Batch sizes provide a better optimisation path

The CIFAR-10 test performance obtained for a reduced AlexNet model over a fixed number of epochs shows that using smaller batches gives a clear performance advantage. For the same base learning rate $\tilde{\eta}$, reducing the batch size delivers improved test accuracy. Also, using smaller batches corresponds to the largest range of learning rates that provide stable convergence.

Revisiting Small Batch Training for Deep Neural Networks (2)Revisiting Small Batch Training for Deep Neural Networks (3)

Modern deep networks commonly employBatch Normalization (Ioffe & Szegedy, 2015), which has been shown to significantly improve training performance. With Batch Normalization, each layer is normalized based on the estimate of the mean and variance from a batch of examples for the activation of one feature. The performance with Batch Normalization for very small batch size is typically affected by the reduced sample size available for estimation of the batch mean and variance. However, the collected data shows best performance with batch sizes smaller than previously reported.

The following figure shows the CIFAR-100 performance for ResNet-32, with Batch Normalization, for different values of batch size $m$ and base learning rate $\tilde{\eta}$. The results show again a significant performance degradation for increasing values of the batch size, with the best results obtained or batch sizes $m = 4$ or $m = 8$. The results also indicate a clear optimum value of base learning rate, which is only achievable for batch sizes $m=8$ or smaller.

Revisiting Small Batch Training for Deep Neural Networks (4)Revisiting Small Batch Training for Deep Neural Networks (5)

As summarized in the following figure, increasing the batch size progressively reduces the range of learning rates that provide stable convergence. This demonstrates how the increased variance in the weight update associated with the use of larger batch sizes can affect the robustness and stability of training. The results clearly indicate that small batches are required to achieve both the best test performance, and allow easier and more robust optimization.

Revisiting Small Batch Training for Deep Neural Networks (6)

Different Batch Sizes for Weight Update and Batch Normalization

In the following figure, we consider the effect of using small sub-batches for Batch Normalization, and larger batches for SGD. This is common practice for the case of data-parallel distributed processing, where Batch Normalization is often implemented independently on each individual processor, while the gradients for the SGD weight updates are aggregated across all workers.

The CIFAR-100 results show a general performance improvement by reducing the overall batch size for the SGD weight updates. We note that the best test accuracy for a given overall SGD batch size is consistently obtained when even smaller batches are used for Batch Normalization. This evidence suggests that to achieve the best performance both a modest overall batch size for SGD and a small batch size for Batch Normalization are required.

Revisiting Small Batch Training for Deep Neural Networks (7)

Why it matters

Using small batch sizes has been seen to achieve the best training stability and generalization performance, for a given computational cost, across a wide range of experiments. The results also highlight the optimization difficulties associated with large batch sizes. Overall, the experimental results support the broad conclusion that using small batch sizes for training provides benefits both in terms of the range of learning rates that provide stable convergence and the test performance for a given number of epochs.

While we are not the first to conclude that smaller mini-batch sizes give better generalization performance, current practice is geared to ever larger batch sizes because today's hardware requires a trade-off between getting more accurate results and synthesizing parallelism to fill the wide vector data-paths of today's processors, and to hide their long latencies to model data stored off-chip in DRAM.

With the arrival of new hardware specifically designed for machine intelligence, like Graphcore’s Intelligence Processing Unit (IPU), it’s time to rethink conventional wisdom on optimal batch size. With the IPU you will be able to run training efficiently even with small batches, and hence achieve both increased accuracy and faster training. In addition, because the IPU holds the entire model inside the processor, you gain an additional speed up by virtue of not having to access external memory continuously. Our benchmark performance results highlight the faster training times that can be achieved.

You can read the full paper herehttps://arxiv.org/abs/1804.07612.

Revisiting Small Batch Training for Deep Neural Networks (2024)

FAQs

What is a good batch size for deep learning? ›

It is a good practice to start with the default batch size of 32 and then try other values if you're not satisfied with the default value. It is better to try smaller batch sizes first. A large batch size typically requires a lot of computational resources to complete an epoch but requires fewer epochs to converge.

What are the benefits of Minibatching? ›

The main advantage of Mini-batch GD over Stochastic GD is that you can get a performance boost from hardware optimization of matrix operations. This method offers a compromise between speed and stability, making it a popular choice in deep learning applications.

What is the benefit of batching your data into mini batches versus using the entire dataset to optimize the model all at once? ›

Stochasticity: The use of mini batches introduces stochasticity into the optimization process, which can help the model avoid getting stuck in local minima and explore the solution space more effectively.

When training a deep learning network what is the impact of using smaller mini batch sizes? ›

Batch size is a slider on the learning process. Small values give a learning process that converges quickly at the cost of noise in the training process. Large values give a learning process that converges slowly with accurate estimates of the error gradient.

Does batch size affect overfitting? ›

The batch size can be understood as a trade-off between accuracy and speed. Large batch sizes can lead to faster training times but may result in lower accuracy and overfitting, while smaller batch sizes can provide better accuracy, but can be computationally expensive and time-consuming.

Is a smaller batch size better? ›

This paper claims that large-batch methods tend to converge to sharp minimizers of the training and testing functions–and that sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers.

What is minibatching? ›

The mini-batch is a fixed number of training examples that is less than the actual dataset. So, in each iteration, we train the network on a different group of samples until all samples of the dataset are used.

Why do we use batches in deep learning? ›

Another benefit of batch processing in ML is that it can improve the accuracy and stability of your model training and optimization. By using batches, you can introduce some randomness and noise into your data, which can prevent overfitting and improve generalization.

How many epochs to train a neural network? ›

The number of epochs is a hyperparameter that must be decided before training begins. A larger number of epochs does not necessarily lead to better results. Generally, a number of 11 epochs is ideal for training on most datasets. Learning optimization is based on the iterative process of gradient descent.

How does mini batch size affect training? ›

The authors of the paper, “On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima”, claim that it is because Large Batch methods tend to result in models that get stuck in local minima. The idea is that smaller batches are more likely to push out local minima and find the Global Minima.

What are two reasons why small batch sizes are important and beneficial? ›

Manufacturers have more control over the process, from start to finish. They can monitor the process more closely and make changes where necessary. This results in a more consistent final product. Additionally, small batch producers can tailor their products to meet specific customer needs.

What happens if we increase the batch size in the deep learning? ›

Increasing the batch size reduces the number of parameter updates during training from just over 14000 to below 6000. Note that the training curves appear unusually noisy because we reduced the number of test set evaluations to reduce the model training time.

Why small batch sizes are important and beneficial? ›

Attending to batch size is important for several reasons. The most common one is that small batches can be done quickly, are easier to manage and can be delivered to customers quickly. But small batches also force product managers to focus on what's most important.

What is the ideal batch size? ›

But generally, the size of 32 is a rule of thumb and a good initial choice.

What should be the ideal batch size? ›

A good starting point is to choose a small batch size, such as 32 or 64, that can fit in your GPU or CPU memory and that can provide a reasonable balance between speed and accuracy.

Is batch size 128 too large? ›

Hi, I'd recommend you start with a batch size that's close to what others have used, maybe on the high end of that (say, 64 or 128). In practice, training with too large a batch size is said to not generalize well to new data (overfitting).

Is 512 batch size good? ›

This blog post argues that 512 is a better batch size (on a totally different application) but honestly, I'll try anything since the training looks like it will take 45 minutes, and I have to babysit it with frequent interactions with the browser or it self-cancels.

Top Articles
Latest Posts
Article information

Author: Mrs. Angelic Larkin

Last Updated:

Views: 6051

Rating: 4.7 / 5 (47 voted)

Reviews: 86% of readers found this page helpful

Author information

Name: Mrs. Angelic Larkin

Birthday: 1992-06-28

Address: Apt. 413 8275 Mueller Overpass, South Magnolia, IA 99527-6023

Phone: +6824704719725

Job: District Real-Estate Facilitator

Hobby: Letterboxing, Vacation, Poi, Homebrewing, Mountain biking, Slacklining, Cabaret

Introduction: My name is Mrs. Angelic Larkin, I am a cute, charming, funny, determined, inexpensive, joyous, cheerful person who loves writing and wants to share my knowledge and understanding with you.