Optimization is a subfield of mathematics / computer science which deals with finding the best solution. Typically, problems in optimization are stated like this:

where

- \(f(x): \mathbb{R}^n \to \mathbb{R}\) is the
**loss function (objective function)**to be minimized over the variable \(x\), - \(g_i(x) \leq 0\) are called
**inequality constraints**, and - \(h_i(x) = 0\) are called
**equality constraints**.

By convention, the standard form defines a **minimization problem**. A
**maximization problem** can be treated by negating the objective function.

(That was copied from en.wikipedia.org/w/Optimization_problem and only slightly edited.)

I'm now going to explain some very basic techniques which are used for finding good solutions to optimization problems. Please note that there are also discrete optimization problems where you have to finde a solution \(x \in \mathbb{N}^n\). I will only focus on continuous optimization problems.

## Simulated Annealing

Simulated Annealing is a heuristical optimization algorithm. It starts at a random point \(x \in \mathbb{R}^n\). Then it takes a random point \(y\) of the environment of \(x\):

If \(f(y) \leq f(x)\), then the current position \(x\) is overwritten with \(y\):

Otherwise, it might be overwritten with probability \(\exp \left (-\frac{f(y)-f(x)}{T(t)} \right )\) where \(T: \mathbb{N}_0 \rightarrow \mathbb{R}_{> 0}\) is called the temperature at time \(t\).

So the optimization algorithm is:

- Take a random point $x \in \mathbb{R}^n$
- Take a random point $y \in U(x)$
- $$x \leftarrow \begin{cases}y &\text{if } f(y) \leq f(x)\\ y &\text{if } \operatorname{rand}(0,1) < \exp \left (-\frac{f(y)-f(x)}{T(t)} \right )\\ x &\text{otherwise}\end{cases}$$
- Go to step 2.

See also my German description.

## Gradient descent

The gradient descent algorithm can easily be applied when the optimization problem has no constraints and the objective function \(f\) is differentiable. The idea is to just take a random starting point \(x \in \mathbb{R}^n\) and iteratively improve it. There are many algorithms which follow this approach (Simulated annealing, L-BFGS, Newton's method, Quasi-Newtonian, Conjugate Gradient, ...).

Instead of randomly going in other directions, the gradient \(\nabla f\) is calculated at the position \(x\). The gradient points in the direction of maximum increase, so we go in the opposite direction:

The problem with this approach is that the surface of the objective function
might first go down in the direction of \(\nabla f(x)\), but if you go a bit
further it can go up by a lot. So we want to make very small steps. To achive
this, we multiply the gradient with a factor \(\eta \in (0, 1]\). In machine
learining, this \(\eta\) is called the *learning rate* and typically one
chooses \(\eta = 0.01\). However, there are learning rate scheduling algorithms which adapt this parameter during training.

The update rule is:

### Iterative Descent

A more general formulation of the Gradient descent algorithm is called iterative descent. The idea is to start at some arbitrary \(x_0\) and iteratively update the current guess of the minimum to

where \(\eta \in (0, 1]\) is the step length (learning rate) and

is the direction of the descent. The direction depends on the Gradient \(\nabla f(x_k)\), but also on a matrix \(D_k\):

- $D_k = I$: Gradient descent
- $D_k = H_f^{-1}(x_k)$: Newtons method, where $H_f$ is the Hessian matrix of $f$

## Linear Regression with MSE

In linear regression one is given a list of \(n\) points \((x, y)\) with \(x \in \mathbb{R}^m\) and \(y \in \mathbb{R}\). The task is to find a matrix \(A \in \mathbb{1 \times m}\) such that the predicted value \(\hat{y}\) of the linear model

minimizes the term

For convenience, one can write the list of points as a matrix \(\mathbf{X} \in \mathbb{R}^{n \times m}\) and a vector \(\mathbf{y} \in \mathbb{R}^n\):

with the Euclidean norm

Every part of the sum is non-negative, so exponentiating the Euclidean norm with a positive factor will not change the result of the minimization:

Now we can see that this is every element of the vector squared. So we can get rid of the norm and then use distributivity:

You need to know that

You can get rid of the last transposing operation, because

This simplifies the optimization problem to

Now you can calculate the gradient of this term with respect to \(A\):

A necessary condition of a minimum is the gradient to be 0:

As \(H_{E_{X, y}} = \nabla^2 E_{X, y}(A) = 2 X^T X\) is positive definite, this is a minimum. Hence, the optimal solution to this problem is:

## Lagrange multipliers

Lagrange multipliers are a trick in optimization problems with constraints. They can be used to get rid of the constraints.

The Lagrange function has the form

with the *Lagrange multipliers* \(\lambda_j \in \mathbb{R}\) and \(h_j\) are equality constraints.

Necessary conditions for a minimum \(x^*\) is:

- $\nabla_x \mathcal{L} = \nabla_x f(x^*) + \sum_{j=1}^n \lambda_j \nabla_x h_j(x^*) \overset{!}{=} 0$
- $\frac{\partial}{\partial \lambda_j} \mathcal{L} = h_j(x^*) \overset{!}{=} 0, \quad j=1, \dots, $

See [Smi04] for many examples.

## Optimization Problem characteristics

There are some properties of optimization problems which make it easier / harder to solve:

Property | Easy | Hard |
---|---|---|

Objective | linear | non-linear |

Optimization Variable | small discrete, continuous | large discrete |

Constraints | No Constraints | Constraints |

## Resources

## References

- [Smi04] B. T. Smith, “Lagrange multipliers tutorial in the context of support vector machines,” Memorial University of Newfoundland St. John’s, Newfoundland, Canada, Jun. 2004.