Diverging Gradient Descent

When you take the function

$$f(x, y) = 3x^2 + 3y^2 + 2xy$$

and start gradient descent at $x_0 = (6, 6)$ with learning rate $\eta = \frac{1}{2}$ it diverges.

Gradient descent ¶

Gradient descent is an optimization rule which starts at a point $x_0$ and then applies the update rule

$$x_{k+1} = x_k + \eta d_k(x_k)$$

where $\eta$ is the step length (learning rate) and $d_k$ is the direction.

The direction is

$$d_k(x_k) = - \nabla f(x_k)$$

Example ¶

$$\nabla f(x, y) = \begin{pmatrix}6x + 2y\6y + 2x\end{pmatrix}$$

\begin{align} x_0 &= (6, 6) & d_k(6, 6) &= (-24, -24)\ x_1 &= (-18, -18) & d_k(-18, -18) &= (72, 72\ x_2 &= (54, 54) & d_k(54, 54) &= (-216, -216)\ x_3 &= (-162, -162) & d_k(-162, -162) &= (648, 648) \end{align}

In general:

\begin{align} x_n &= (x_{n-1} - 8 \cdot \frac{1}{2} \cdot x_{n-1}, x_{n-1} - 8 \cdot \frac{1}{2} \cdot x_{n-1})\ x_n &= (-3x_{n-1}, -3x_{n-1}) \end{align}

You can clearly see that any learning rate $\eta > \frac{1}{8}$ will diverge. For this example, the learning rate $\eta = \frac{1}{8}$ would find the solution in one step and any $\eta < \frac{1}{8}$ will converge to the global optimum.

Diverging Gradient Descent

Gradient descent ¶

Example ¶

Published

Category

Tags

Contact