Sunday, July 16, 2023

What are various optimisation Algorithms

Momentum takes into account the history of gradients during optimization. It introduces a momentum term that accumulates the past gradients and uses them to guide the parameter updates. The momentum term helps to smooth out the variations in gradient updates and allows the optimizer to navigate through regions with high curvature more efficiently.


In the Momentum optimization algorithm, the parameter update at each iteration is a combination of the current gradient and the accumulated past gradients. The update is influenced by both the current gradient and the momentum term, which is a fraction of the previous parameter update.


RMSprop


RMSprop stands for "Root Mean Square Propagation." It is an adaptive learning rate optimization algorithm that adjusts the learning rate for each parameter based on the average of the squared gradients of that parameter. The main idea behind RMSprop is to divide the learning rate by the square root of the exponentially decaying average of squared gradients.


In RMSprop, the learning rate for each parameter is individually computed and updated during the optimization process. The learning rate is scaled by the square root of the moving average of the squared gradients. This scaling allows the algorithm to reduce the learning rate for parameters with large and frequent updates, while increasing the learning rate for parameters with small and infrequent updates.




Adagrad:

Adagrad stands for "Adaptive Gradient Algorithm." It addresses the challenge of choosing an appropriate learning rate by automatically scaling the learning rates for each parameter based on the frequency and magnitude of their past gradients.


In Adagrad, the learning rate for each parameter is individually computed and updated during the optimization process. The learning rate is inversely proportional to the square root of the sum of squared gradients accumulated for that parameter. This means that parameters with smaller gradients receive larger learning rates, while parameters with larger gradients receive smaller learning rates.




Nadam: 

Nadam stands for "Nesterov-accelerated Adaptive Moment Estimation." It combines the benefits of Nesterov Momentum and the adaptive learning rate scheme of Adam to improve the convergence speed and optimization performance.


In Nadam, the parameter updates are based on a combination of the current gradient and the momentum term. It incorporates the Nesterov Momentum technique, which calculates the gradient based on the lookahead position using the momentum term. This lookahead computation helps to make more accurate updates and improves convergence.


Gradient Descent is a fundamental optimization algorithm commonly used in machine learning and deep learning. It is widely employed to minimize the loss or cost function during the training of models.


Gradient Descent is an iterative optimization algorithm that aims to find the minimum of a function by iteratively adjusting the model parameters in the direction of steepest descent. It utilizes the gradients of the function with respect to the parameters to guide the parameter updates.


The basic idea behind Gradient Descent is as follows:


Initialize the model parameters with some initial values.

Compute the gradients of the loss or cost function with respect to the parameters.

Update the parameters by taking steps proportional to the negative gradients.

Repeat steps 2 and 3 until convergence or a specified number of iterations.



references:


No comments:

Post a Comment