Using SVMs with sklearn

Support Vector Machines (SVMs) is a group of powerful classifiers. In this article, I will give a short impression of how they work. I continue with an example how to use SVMs with sklearn.

SVM theory

SVMs can be described with 5 ideas in mind:

Linear, binary classifiers: If data is linearly separable, it can be separated by a hyperplane. There is one hyperplane which maximizes the distance to the next datapoints (support vectors). This hyperplane should be taken:

$$ \begin{aligned} \text{minimize}_{\mathbf{w}, b}\,&\frac{1}{2} \|\mathbf{w}\|^2\\ \text{s.t. }& \forall_{i=1}^m y_i \cdot \underbrace{(\langle \mathbf{w}, \mathbf{x}_i\rangle + b)}_{\text{Classification}} \geq 1 \end{aligned}$$
Slack variables: Even if the underlying process which generates the features for the two classes is linearly separable, noise can make the data not separable. The introduction of slack variables to relax the requirement of linear separability solves this problem. The trade-off between accepting some errors and a more complex model is weighted by a parameter $C \in \mathbb{R}_0^+$. The bigger $C$, the more errors are accepted. The new optimization problem is: $$ \begin{aligned} \text{minimize}_{\mathbf{w}, b}\,&\frac{1}{2} \|\mathbf{w}\|^2 + C \cdot \sum_{i=1}^m \xi_i\\ \text{s.t. }& \forall_{i=1}^m y_i \cdot (\langle \mathbf{w}, \mathbf{x}_i\rangle + b) \geq 1 - \xi_i \end{aligned}$$ Note that $0 \le \xi_i \le 1$ means that the data point is within the margin, whereas $\xi_i \ge 1$ means it is misclassified. An SVM with $C > 0$ is also called a soft-margin SVM.
Dual Problem: The primal problem is to find the normal vector $\mathbf{w}$ and the bias $b$. The dual problem is to express $\mathbf{w}$ as a linear combination of the training data $\mathbf{x}_i$: $$\mathbf{w} = \sum_{i=1}^m \alpha_i y_i \mathbf{x}_i$$ where $y_i \in \{-1, 1\}$ represents the class of the training example and $\alpha_i$ are Lagrange multipliers. The usage of Lagrange multipliers is explained with some examples in [Smi04]. The usage of the Lagrange multipliers $\alpha_i$ changes the optimization problem depend on the $\alpha_i$ which are weights for the feature vectors. It turns out that most $\alpha_i$ will be zero. The non-zero weighted vectors are called support vectors. The optimization problem is now, according to [Bur98] (a great read; if you really want to understand it I can recommend it!): $$ \begin{aligned} \text{maximize}_{\alpha_i}\,& \sum_{i=1}^m \alpha_i - \frac{1}{2} \sum_{i=1}^m \sum_{j=1}^m \alpha_i \alpha_j y_i y_j \langle \mathbf{x}_i, \mathbf{x}_j \rangle\\ \text{s.t. } & \forall_{i=1}^m 0 \leq \alpha_i \leq C\\ \text{s.t. } & \sum_{i=1}^m \alpha_i y_i = 0 \end{aligned}$$
Kernel-Trick: Not every dataset is linearly separable. This problem is approached by transforming the feature vectors $\mathbf{x}$ with a non-linear mapping $\Phi$ into a higher dimensional (probably $\infty$-dimensional) space. As the feature vectors $\mathbf{x}$ are only used within scalar product $\langle \mathbf{x}_i, \mathbf{x}_j \rangle$, it is not necessary to do the transformation. It is enough to do the calculation $$K(\mathbf{x}_i, \mathbf{x}_j) = \langle \mathbf{x}_i, \mathbf{x}_j \rangle$$ This function $K$ is called a kernel. The idea of never explicitly transforming the vectors $\mathbf{x}_i$ to the higher dimensional space is called the kernel trick. Common kernels include the polynomial kernel $$K_P(\mathbf{x}_i, \mathbf{x}_j) = (\langle \mathbf{x}_i, \mathbf{x}_j \rangle + r)^p$$ of degree $p$ and coefficient $r$, the Gaussian RBF kernel $$K_{\text{Gauss}}(\mathbf{x}_i, \mathbf{x}_j) = e^{\frac{-\gamma\|\mathbf{x}_i - \mathbf{x}_j\|^2}{2 \sigma^2}}$$ and the sigmoid kernel $$K_{\text{tanh}}(\mathbf{x}_i, \mathbf{x}_j) = \tanh(\gamma \langle \mathbf{x}_i, \mathbf{x}_j \rangle - r)$$ where the parameter $\gamma$ determines how much influence single training examples have.
Multiple Classes: By using the one-vs-all or the one-vs-one strategy it is possible to get a classifying system which can distinguish many classes.

A nice visualization of the transformation of the data in a higher-dimensional space was done by

TeamGrizzly's channel: Performing nonlinear classification via linear separation in higher dimensional space on YouTube. 22.11.2010.

sklearn

sklearn is the machine learning toolkit to get started for Python. It has a very good documentation and many functions. You can find installation instructions on their website.

It also includes sklearn.svm.SVC. SVC is short for support vector classifier and this is how you use it for the MNIST dataset.

Parameters for which you might want a further explanation:

cache_size: datascience.stackexchange.com

#!/usr/bin/env python

"""
Train a SVM to categorize 28x28 pixel images into digits (MNIST dataset).
"""

import numpy as np


def main():
    """Orchestrate the retrival of data, training and testing."""
    data = get_data()

    # Get classifier
    from sklearn.svm import SVC

    clf = SVC(probability=False, kernel="rbf", C=2.8, gamma=0.0073)  # cache_size=200,

    print("Start fitting. This may take a while")

    # take all of it - make that number lower for experiments
    examples = len(data["train"]["X"])
    clf.fit(data["train"]["X"][:examples], data["train"]["y"][:examples])

    analyze(clf, data)


def analyze(clf, data):
    """
    Analyze how well a classifier performs on data.

    Parameters
    ----------
    clf : classifier object
    data : dict
    """
    # Get confusion matrix
    from sklearn import metrics

    predicted = clf.predict(data["test"]["X"])
    print(
        "Confusion matrix:\n%s" % metrics.confusion_matrix(data["test"]["y"], predicted)
    )
    print("Accuracy: %0.4f" % metrics.accuracy_score(data["test"]["y"], predicted))

    # Print example
    try_id = 1
    out = clf.predict(data["test"]["X"][try_id])  # clf.predict_proba
    print("out: %s" % out)
    size = int(len(data["test"]["X"][try_id]) ** (0.5))
    view_image(
        data["test"]["X"][try_id].reshape((size, size)), data["test"]["y"][try_id]
    )


def view_image(image, label=""):
    """
    View a single image.

    Parameters
    ----------
    image : numpy array
        Make sure this is of the shape you want.
    label : str
    """
    from matplotlib.pyplot import show, imshow, cm

    print("Label: %s" % label)
    imshow(image, cmap=cm.gray)
    show()


def get_data():
    """
    Get data ready to learn with.

    Returns
    -------
    dict
    """
    simple = False
    if simple:  # Load the simple, but similar digits dataset
        from sklearn.datasets import load_digits
        from sklearn.utils import shuffle

        digits = load_digits()
        x = [np.array(el).flatten() for el in digits.images]
        y = digits.target

        # Scale data to [-1, 1] - This is of mayor importance!!!
        # In this case, I know the range and thus I can (and should) scale
        # manually. However, this might not always be the case.
        # Then try sklearn.preprocessing.MinMaxScaler or
        # sklearn.preprocessing.StandardScaler
        x = x / 255.0 * 2 - 1

        x, y = shuffle(x, y, random_state=0)

        from sklearn.cross_validation import train_test_split

        x_train, x_test, y_train, y_test = train_test_split(
            x, y, test_size=0.33, random_state=42
        )
        data = {
            "train": {"X": x_train, "y": y_train},
            "test": {"X": x_test, "y": y_test},
        }
    else:  # Load the original dataset
        from sklearn.datasets import fetch_mldata
        from sklearn.utils import shuffle

        mnist = fetch_mldata("MNIST original")

        x = mnist.data
        y = mnist.target

        # Scale data to [-1, 1] - This is of mayor importance!!!
        x = x / 255.0 * 2 - 1

        x, y = shuffle(x, y, random_state=0)

        from sklearn.cross_validation import train_test_split

        x_train, x_test, y_train, y_test = train_test_split(
            x, y, test_size=0.33, random_state=42
        )
        data = {
            "train": {"X": x_train, "y": y_train},
            "test": {"X": x_test, "y": y_test},
        }
    return data


if __name__ == "__main__":
    main()

Results

The script from above gives the following results:

Confusion matrix for an SVM classifier on the MNIST dataset
	0	1	2	3	4	5	6	7	8	9
0	2258	1	4	1	2	2	3	1	4	2
1	1	2566	9	1	1	0	0	7	3	0
2	4	1	2280	5	4	0	1	9	8	2
3	0	0	14	2304	1	13	0	6	8	2
4	2	2	2	0	2183	0	7	5	0	10
5	4	0	0	16	3	2026	12	1	4	3
6	7	5	3	0	5	2	2245	0	4	0
7	1	6	11	2	5	1	0	2373	5	13
8	3	9	4	9	4	10	2	3	2166	5
9	3	2	2	6	19	6	0	12	10	2329

Accuracy: 98.40%
Error: 1.60%

Looks pretty good to me. However, note that there are much better results. The best on the official website has an error of 0.23% and is a committee of 35 convolutional neural networks.

The best SVM I could find has an error of 0.56% and applies a polynomial kernel of degree 9 as well as some preprocessing.

References

[Smi04] B. T. Smith, “Lagrange multipliers tutorial in the context of support vector machines,” Memorial University of Newfoundland St. John’s, Newfoundland, Canada, Jun. 2004.
[Bur98] C. J. Burges, “A tutorial on support vector machines for pattern recognition”, Data mining and knowledge discovery, vol. 2, no. 2, pp. 121–167, 1998.

Using SVMs with sklearn

SVM theory

sklearn

Results

References

See also

Published

Category

Tags

Contact