Support Vector Machines (SVMs) is a group of powerful classifiers. In this article, I will give a short impression of how they work. I continue with an example how to use SVMs with sklearn.
SVM theory
SVMs can be described with 5 ideas in mind:
- Linear, binary classifiers: If data is linearly separable, it
can be separated by a hyperplane. There is one hyperplane which
maximizes the distance to the next datapoints (support vectors). This
hyperplane should be taken:
$$ \begin{aligned} \text{minimize}_{\mathbf{w}, b}\,&\frac{1}{2} \|\mathbf{w}\|^2\\ \text{s.t. }& \forall_{i=1}^m y_i \cdot \underbrace{(\langle \mathbf{w}, \mathbf{x}_i\rangle + b)}_{\text{Classification}} \geq 1 \end{aligned}$$ - Slack variables: Even if the underlying process which generates the features for the two classes is linearly separable, noise can make the data not separable. The introduction of slack variables to relax the requirement of linear separability solves this problem. The trade-off between accepting some errors and a more complex model is weighted by a parameter $C \in \mathbb{R}_0^+$. The bigger $C$, the more errors are accepted. The new optimization problem is: $$ \begin{aligned} \text{minimize}_{\mathbf{w}, b}\,&\frac{1}{2} \|\mathbf{w}\|^2 + C \cdot \sum_{i=1}^m \xi_i\\ \text{s.t. }& \forall_{i=1}^m y_i \cdot (\langle \mathbf{w}, \mathbf{x}_i\rangle + b) \geq 1 - \xi_i \end{aligned}$$ Note that $0 \le \xi_i \le 1$ means that the data point is within the margin, whereas $\xi_i \ge 1$ means it is misclassified. An SVM with $C > 0$ is also called a soft-margin SVM.
- Dual Problem: The primal problem is to find the normal vector $\mathbf{w}$ and the bias $b$. The dual problem is to express $\mathbf{w}$ as a linear combination of the training data $\mathbf{x}_i$: $$\mathbf{w} = \sum_{i=1}^m \alpha_i y_i \mathbf{x}_i$$ where $y_i \in \{-1, 1\}$ represents the class of the training example and $\alpha_i$ are Lagrange multipliers. The usage of Lagrange multipliers is explained with some examples in [Smi04]. The usage of the Lagrange multipliers $\alpha_i$ changes the optimization problem depend on the $\alpha_i$ which are weights for the feature vectors. It turns out that most $\alpha_i$ will be zero. The non-zero weighted vectors are called support vectors. The optimization problem is now, according to [Bur98] (a great read; if you really want to understand it I can recommend it!): $$ \begin{aligned} \text{maximize}_{\alpha_i}\,& \sum_{i=1}^m \alpha_i - \frac{1}{2} \sum_{i=1}^m \sum_{j=1}^m \alpha_i \alpha_j y_i y_j \langle \mathbf{x}_i, \mathbf{x}_j \rangle\\ \text{s.t. } & \forall_{i=1}^m 0 \leq \alpha_i \leq C\\ \text{s.t. } & \sum_{i=1}^m \alpha_i y_i = 0 \end{aligned}$$
- Kernel-Trick: Not every dataset is linearly separable. This problem is approached by transforming the feature vectors $\mathbf{x}$ with a non-linear mapping $\Phi$ into a higher dimensional (probably $\infty$-dimensional) space. As the feature vectors $\mathbf{x}$ are only used within scalar product $\langle \mathbf{x}_i, \mathbf{x}_j \rangle$, it is not necessary to do the transformation. It is enough to do the calculation $$K(\mathbf{x}_i, \mathbf{x}_j) = \langle \mathbf{x}_i, \mathbf{x}_j \rangle$$ This function $K$ is called a kernel. The idea of never explicitly transforming the vectors $\mathbf{x}_i$ to the higher dimensional space is called the kernel trick. Common kernels include the polynomial kernel $$K_P(\mathbf{x}_i, \mathbf{x}_j) = (\langle \mathbf{x}_i, \mathbf{x}_j \rangle + r)^p$$ of degree $p$ and coefficient $r$, the Gaussian RBF kernel $$K_{\text{Gauss}}(\mathbf{x}_i, \mathbf{x}_j) = e^{\frac{-\gamma\|\mathbf{x}_i - \mathbf{x}_j\|^2}{2 \sigma^2}}$$ and the sigmoid kernel $$K_{\text{tanh}}(\mathbf{x}_i, \mathbf{x}_j) = \tanh(\gamma \langle \mathbf{x}_i, \mathbf{x}_j \rangle - r)$$ where the parameter $\gamma$ determines how much influence single training examples have.
- Multiple Classes: By using the one-vs-all or the one-vs-one strategy it is possible to get a classifying system which can distinguish many classes.
A nice visualization of the transformation of the data in a higher-dimensional space was done by
TeamGrizzly's channel: Performing nonlinear classification via linear separation in higher dimensional space on YouTube. 22.11.2010.
See also:
- What is an example of a SVM kernel, where one implicitly uses an infinity-dimensional space?
- SVM - hard or soft margins?
sklearn
sklearn
is the machine learning toolkit to get started for Python. It has
a very good documentation and many functions. You can find installation
instructions on their website.
It also includes sklearn.svm.SVC
.
SVC is short for support vector classifier and this is how you use it for
the MNIST dataset.
Parameters for which you might want a further explanation:
cache_size
: datascience.stackexchange.com
#!/usr/bin/env python
"""
Train a SVM to categorize 28x28 pixel images into digits (MNIST dataset).
"""
import numpy as np
def main():
"""Orchestrate the retrival of data, training and testing."""
data = get_data()
# Get classifier
from sklearn.svm import SVC
clf = SVC(probability=False, kernel="rbf", C=2.8, gamma=0.0073) # cache_size=200,
print("Start fitting. This may take a while")
# take all of it - make that number lower for experiments
examples = len(data["train"]["X"])
clf.fit(data["train"]["X"][:examples], data["train"]["y"][:examples])
analyze(clf, data)
def analyze(clf, data):
"""
Analyze how well a classifier performs on data.
Parameters
----------
clf : classifier object
data : dict
"""
# Get confusion matrix
from sklearn import metrics
predicted = clf.predict(data["test"]["X"])
print(
"Confusion matrix:\n%s" % metrics.confusion_matrix(data["test"]["y"], predicted)
)
print("Accuracy: %0.4f" % metrics.accuracy_score(data["test"]["y"], predicted))
# Print example
try_id = 1
out = clf.predict(data["test"]["X"][try_id]) # clf.predict_proba
print("out: %s" % out)
size = int(len(data["test"]["X"][try_id]) ** (0.5))
view_image(
data["test"]["X"][try_id].reshape((size, size)), data["test"]["y"][try_id]
)
def view_image(image, label=""):
"""
View a single image.
Parameters
----------
image : numpy array
Make sure this is of the shape you want.
label : str
"""
from matplotlib.pyplot import show, imshow, cm
print("Label: %s" % label)
imshow(image, cmap=cm.gray)
show()
def get_data():
"""
Get data ready to learn with.
Returns
-------
dict
"""
simple = False
if simple: # Load the simple, but similar digits dataset
from sklearn.datasets import load_digits
from sklearn.utils import shuffle
digits = load_digits()
x = [np.array(el).flatten() for el in digits.images]
y = digits.target
# Scale data to [-1, 1] - This is of mayor importance!!!
# In this case, I know the range and thus I can (and should) scale
# manually. However, this might not always be the case.
# Then try sklearn.preprocessing.MinMaxScaler or
# sklearn.preprocessing.StandardScaler
x = x / 255.0 * 2 - 1
x, y = shuffle(x, y, random_state=0)
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.33, random_state=42
)
data = {
"train": {"X": x_train, "y": y_train},
"test": {"X": x_test, "y": y_test},
}
else: # Load the original dataset
from sklearn.datasets import fetch_mldata
from sklearn.utils import shuffle
mnist = fetch_mldata("MNIST original")
x = mnist.data
y = mnist.target
# Scale data to [-1, 1] - This is of mayor importance!!!
x = x / 255.0 * 2 - 1
x, y = shuffle(x, y, random_state=0)
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.33, random_state=42
)
data = {
"train": {"X": x_train, "y": y_train},
"test": {"X": x_test, "y": y_test},
}
return data
if __name__ == "__main__":
main()
Results
The script from above gives the following results:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2258 | 1 | 4 | 1 | 2 | 2 | 3 | 1 | 4 | 2 |
1 | 1 | 2566 | 9 | 1 | 1 | 0 | 0 | 7 | 3 | 0 |
2 | 4 | 1 | 2280 | 5 | 4 | 0 | 1 | 9 | 8 | 2 |
3 | 0 | 0 | 14 | 2304 | 1 | 13 | 0 | 6 | 8 | 2 |
4 | 2 | 2 | 2 | 0 | 2183 | 0 | 7 | 5 | 0 | 10 |
5 | 4 | 0 | 0 | 16 | 3 | 2026 | 12 | 1 | 4 | 3 |
6 | 7 | 5 | 3 | 0 | 5 | 2 | 2245 | 0 | 4 | 0 |
7 | 1 | 6 | 11 | 2 | 5 | 1 | 0 | 2373 | 5 | 13 |
8 | 3 | 9 | 4 | 9 | 4 | 10 | 2 | 3 | 2166 | 5 |
9 | 3 | 2 | 2 | 6 | 19 | 6 | 0 | 12 | 10 | 2329 |
- Accuracy: 98.40%
- Error: 1.60%
Looks pretty good to me. However, note that there are much better results. The best on the official website has an error of 0.23% and is a committee of 35 convolutional neural networks.
The best SVM I could find has an error of 0.56% and applies a polynomial kernel of degree 9 as well as some preprocessing.
References
- [Smi04] B. T. Smith, “Lagrange multipliers tutorial in the context of support vector machines,” Memorial University of Newfoundland St. John’s, Newfoundland, Canada, Jun. 2004.
- [Bur98] C. J. Burges, “A tutorial on support vector machines for pattern recognition”, Data mining and knowledge discovery, vol. 2, no. 2, pp. 121–167, 1998.