Whitening is a transformation of data in such a way that its covariance matrix \(\Sigma\) is the identity matrix. Hence whitening decorrelates features. It is used as a preprocessing method.
When you have \(N\) data points in \(\mathbb{R}^n\), then the covariance matrix \(\Sigma \in \mathbb{R}^{n \times n}\) is estimated to be:
where \(\bar{x}_j\) denotes the \(j\)-th component of the estimated mean of the samples \(x\).
Any matrix \(W \in \mathbb{R}^{n \times n}\) which satisfies the condition
whitens the data. ZCA whitening is the choice \(W = M^{- \frac{1}{2}}\). PCA is another choice. According to "Neural Networks: Tricks of the Trade" PCA and ZCA whitening differ only by a rotation.
How to do it
When you look at the Keras code, you can see the following:
# Calculate principal components
sigma = np.dot(flat_x.T, flat_x) / flat_x.shape[0]
u, s, _ = linalg.svd(sigma)
principal_components = np.dot(np.dot(u, np.diag(1.0 / np.sqrt(s + 10e-7))), u.T)
# Apply ZCA whitening
whitex = np.dot(flat_x, principal_components)
So, at first you compute the covariance matrix \(\Sigma\). I'm not quite sure,
but I think they should divide by flat_x.shape[0] - 1
for the unbiased
estimator.
Then you apply singular value decomposition to the estimated covariance matrix. The matrix \(u \in \mathbb{R}^{n \times n}\) is unitary and \(s \in \mathbb{R}^{n \times n}\) is a diagonal matrix with non-negative real numbers on the diagonal. Those number are the singular values of \(\Sigma\).
Next, the principal components are calculated: [u \cdot \frac{1}{\sqrt{s + 10^{-7}}} I \cdot u^T]
By adding 10e-7
one prevents division by zero.
Whitening is then simply the multiplication with the principal components.
See also
- Alex Krizhevsky and Geoffrey Hinton: Learning multiple layers of features from tiny images
- Optimal whitening and decorrelation