Optimization is a subfield of mathematics / computer science which deals with finding the best solution. Typically, problems in optimization are stated like this:
where
- \(f(x): \mathbb{R}^n \to \mathbb{R}\) is the loss function (objective function) to be minimized over the variable \(x\),
- \(g_i(x) \leq 0\) are called inequality constraints, and
- \(h_i(x) = 0\) are called equality constraints.
By convention, the standard form defines a minimization problem. A maximization problem can be treated by negating the objective function.
(That was copied from en.wikipedia.org/w/Optimization_problem and only slightly edited.)
I'm now going to explain some very basic techniques which are used for finding good solutions to optimization problems. Please note that there are also discrete optimization problems where you have to finde a solution \(x \in \mathbb{N}^n\). I will only focus on continuous optimization problems.
Simulated Annealing
Simulated Annealing is a heuristical optimization algorithm. It starts at a random point \(x \in \mathbb{R}^n\). Then it takes a random point \(y\) of the environment of \(x\):
If \(f(y) \leq f(x)\), then the current position \(x\) is overwritten with \(y\):
Otherwise, it might be overwritten with probability \(\exp \left (-\frac{f(y)-f(x)}{T(t)} \right )\) where \(T: \mathbb{N}_0 \rightarrow \mathbb{R}_{> 0}\) is called the temperature at time \(t\).
So the optimization algorithm is:
- Take a random point $x \in \mathbb{R}^n$
- Take a random point $y \in U(x)$
- $$x \leftarrow \begin{cases}y &\text{if } f(y) \leq f(x)\\ y &\text{if } \operatorname{rand}(0,1) < \exp \left (-\frac{f(y)-f(x)}{T(t)} \right )\\ x &\text{otherwise}\end{cases}$$
- Go to step 2.
See also my German description.
Gradient descent
The gradient descent algorithm can easily be applied when the optimization problem has no constraints and the objective function \(f\) is differentiable. The idea is to just take a random starting point \(x \in \mathbb{R}^n\) and iteratively improve it. There are many algorithms which follow this approach (Simulated annealing, L-BFGS, Newton's method, Quasi-Newtonian, Conjugate Gradient, ...).
Instead of randomly going in other directions, the gradient \(\nabla f\) is calculated at the position \(x\). The gradient points in the direction of maximum increase, so we go in the opposite direction:
The problem with this approach is that the surface of the objective function might first go down in the direction of \(\nabla f(x)\), but if you go a bit further it can go up by a lot. So we want to make very small steps. To achive this, we multiply the gradient with a factor \(\eta \in (0, 1]\). In machine learining, this \(\eta\) is called the learning rate and typically one chooses \(\eta = 0.01\). However, there are learning rate scheduling algorithms which adapt this parameter during training.
The update rule is:
Iterative Descent
A more general formulation of the Gradient descent algorithm is called iterative descent. The idea is to start at some arbitrary \(x_0\) and iteratively update the current guess of the minimum to
where \(\eta \in (0, 1]\) is the step length (learning rate) and
is the direction of the descent. The direction depends on the Gradient \(\nabla f(x_k)\), but also on a matrix \(D_k\):
- $D_k = I$: Gradient descent
- $D_k = H_f^{-1}(x_k)$: Newtons method, where $H_f$ is the Hessian matrix of $f$
Linear Regression with MSE
In linear regression one is given a list of \(n\) points \((x, y)\) with \(x \in \mathbb{R}^m\) and \(y \in \mathbb{R}\). The task is to find a matrix \(A \in \mathbb{1 \times m}\) such that the predicted value \(\hat{y}\) of the linear model
minimizes the term
For convenience, one can write the list of points as a matrix \(\mathbf{X} \in \mathbb{R}^{n \times m}\) and a vector \(\mathbf{y} \in \mathbb{R}^n\):
with the Euclidean norm
Every part of the sum is non-negative, so exponentiating the Euclidean norm with a positive factor will not change the result of the minimization:
Now we can see that this is every element of the vector squared. So we can get rid of the norm and then use distributivity:
You need to know that
You can get rid of the last transposing operation, because
This simplifies the optimization problem to
Now you can calculate the gradient of this term with respect to \(A\):
A necessary condition of a minimum is the gradient to be 0:
As \(H_{E_{X, y}} = \nabla^2 E_{X, y}(A) = 2 X^T X\) is positive definite, this is a minimum. Hence, the optimal solution to this problem is:
Lagrange multipliers
Lagrange multipliers are a trick in optimization problems with constraints. They can be used to get rid of the constraints.
The Lagrange function has the form
with the Lagrange multipliers \(\lambda_j \in \mathbb{R}\) and \(h_j\) are equality constraints.
Necessary conditions for a minimum \(x^*\) is:
- $\nabla_x \mathcal{L} = \nabla_x f(x^*) + \sum_{j=1}^n \lambda_j \nabla_x h_j(x^*) \overset{!}{=} 0$
- $\frac{\partial}{\partial \lambda_j} \mathcal{L} = h_j(x^*) \overset{!}{=} 0, \quad j=1, \dots, $
See [Smi04] for many examples.
Optimization Problem characteristics
There are some properties of optimization problems which make it easier / harder to solve:
Property | Easy | Hard |
---|---|---|
Objective | linear | non-linear |
Optimization Variable | small discrete, continuous | large discrete |
Constraints | No Constraints | Constraints |
Resources
References
- [Smi04] B. T. Smith, “Lagrange multipliers tutorial in the context of support vector machines,” Memorial University of Newfoundland St. John’s, Newfoundland, Canada, Jun. 2004.