What is the difference between stochastic gradient descent (SGD) and gradient descent (GD)?

Gradient Descent and Stochastic Gradient Descent are the algorithms that find the set of parameters that will minimize a loss function.
The difference is that in Gradient Descend, all training samples are evaluated for each set of parameters. While in Stochastic Gradient Descent only one training sample is evaluated for the set of parameters identified.

In a machine learning interview, the key differences between stochastic gradient descent (SGD) and gradient descent (GD) can be explained as follows:

  1. Update Rule:
    • Gradient Descent (GD): In GD, the parameters are updated after computing the gradient of the cost function with respect to all samples in the training dataset. The average gradient is then used to update the parameters.
    • Stochastic Gradient Descent (SGD): In SGD, the parameters are updated after computing the gradient of the cost function with respect to a single randomly chosen sample (or a small subset, known as a mini-batch) from the training dataset.
  2. Computational Complexity:
    • GD typically requires more computational resources since it computes gradients for all training samples in each iteration.
    • SGD is computationally less expensive since it only computes gradients for a single sample (or a mini-batch), making it faster particularly for large datasets.
  3. Convergence:
    • GD converges to the minimum of the cost function more slowly compared to SGD, especially for large datasets, due to the heavier computational burden of processing all data points in each iteration.
    • SGD converges faster initially but may oscillate near the minimum, and the convergence might not be as smooth as GD. However, techniques like learning rate annealing and momentum can help stabilize the convergence in SGD.
  4. Stability:
    • GD tends to have a more stable convergence trajectory because it considers all data points.
    • SGD’s convergence trajectory can be noisier due to the stochastic nature of selecting a single (or a subset) of samples in each iteration.
  5. Generalization:
    • SGD may offer better generalization, especially when the training dataset is large, as it is less likely to get stuck in local minima and can escape saddle points more effectively due to its stochastic nature.
    • GD might be prone to overfitting when dealing with large datasets since it doesn’t provide as much stochasticity in the optimization process.
  6. Memory Usage:
    • GD requires more memory since it needs to store the entire dataset in memory for computing gradients.
    • SGD requires less memory since it processes one sample (or mini-batch) at a time.

In summary, while GD provides a more deterministic and stable optimization process, it can be computationally expensive and less suitable for large datasets. On the other hand, SGD is faster, more scalable, and can offer better generalization but may require tuning of hyperparameters and exhibit noisier convergence behavior. Additionally, variations like mini-batch gradient descent strike a balance between the two approaches by using a compromise between full-batch GD and SGD.