• Martin Thoma
  • Home
  • Categories
  • Tags
  • Archives
  • Support me

Diverging Gradient Descent

Contents

  • Gradient descent
  • Example

When you take the function

$$f(x, y) = 3x^2 + 3y^2 + 2xy$$

and start gradient descent at $x_0 = (6, 6)$ with learning rate $\eta = \frac{1}{2}$ it diverges.

Gradient descent

Gradient descent is an optimization rule which starts at a point $x_0$ and then applies the update rule

$$x_{k+1} = x_k + \eta d_k(x_k)$$

where $\eta$ is the step length (learning rate) and $d_k$ is the direction.

The direction is

$$d_k(x_k) = - \nabla f(x_k)$$

Example

$$\nabla f(x, y) = \begin{pmatrix}6x + 2y\6y + 2x\end{pmatrix}$$

\begin{align} x_0 &= (6, 6) & d_k(6, 6) &= (-24, -24)\ x_1 &= (-18, -18) & d_k(-18, -18) &= (72, 72\ x_2 &= (54, 54) & d_k(54, 54) &= (-216, -216)\ x_3 &= (-162, -162) & d_k(-162, -162) &= (648, 648) \end{align}

In general:

\begin{align} x_n &= (x_{n-1} - 8 \cdot \frac{1}{2} \cdot x_{n-1}, x_{n-1} - 8 \cdot \frac{1}{2} \cdot x_{n-1})\ x_n &= (-3x_{n-1}, -3x_{n-1}) \end{align}

You can clearly see that any learning rate $\eta > \frac{1}{8}$ will diverge. For this example, the learning rate $\eta = \frac{1}{8}$ would find the solution in one step and any $\eta < \frac{1}{8}$ will converge to the global optimum.

Published

Jul 21, 2016
by Martin Thoma

Category

Machine Learning

Tags

  • Gradient Descent 2
  • Optimization 4

Contact

  • Martin Thoma - A blog about Code, the Web and Cyberculture
  • E-mail subscription
  • RSS-Feed
  • Privacy/Datenschutzerklärung
  • Impressum
  • Powered by Pelican. Theme: Elegant by Talha Mansoor