Visualizing the convergence of all three variants of gradient descent.
Computes the gradient of the cost function w.r.t parameters for the entire dataset. We update the parameters only once per epoch.
Stochastic gradient descent (SGD) in contrast performs a parameter update for each training example. SGD performs frequent updates with a high variance that cause the objective function to fluctuate heavily.
Mini-batch gradient descent finally takes the best of both worlds and performs an update for every mini-batch of "bs" training examples.