Shusen Wang, wssatzju@gmail.com
We first prepare the data for our experiments.
Let us load a matrix from the file '../data/letters.txt'. The matix ${\bf X}\in \mathbb{R}^{15,000\times 16}$ has $15,000$ samples and $16$ features. Let ${\bf X}_1$ contain the first $2,000$ rows and ${\bf X}_2$ contain the next $5,000$ rows. We then form the $2,000\times 5,000$ kernel matrix ${\bf A}$ using the rbf function.
The rbf function is defined in 'rbf.py'. We describe it in the following.
# prepare data
import numpy as np
from rbf import rbf
# load matrix X from file
filepath = '../data/letters.txt'
matrixX = np.loadtxt(filepath)
# form the RBF kernel matrix
matrixX1 = matrixX[0:2000, :]
matrixX2 = matrixX[2000:7000, :]
sigma = 0.7
matrixA = rbf(matrixX1, matrixX2, sigma)
print('The matrix A has ', matrixA.shape[0], ' rows and ', matrixA.shape[1], ' columns.')
In a word, adaptive sampling is to select columns according to the residual of the previous round. Adaptive sampling can be described in the following.
The following Python code implements the adaptive sampling algorithm.
import numpy as np
def adaptiveSampling(matrixA, matrixC1, c2):
# compute the residual
matrixQ, matrixR = np.linalg.qr(matrixC1, mode='reduced')
matrixQQA = np.dot(matrixQ, np.dot(matrixQ.transpose(), matrixA))
matrixRes = matrixA - matrixQQA
# compute the sampling probabilites
matrixRes = np.square(matrixRes)
prob = sum(matrixRes)
prob = prob / sum(prob)
return np.random.choice(matrixA.shape[1], c2, replace=False, p=prob)
c1 = 100;
c2 = 200;
# matrix C1 contains c1 uniformly sampled columns of matrix A
indexC1 = np.random.choice(matrixA.shape[1], c1, replace=False)
matrixC1 = matrixA[:, indexC1]
indexC2 = adaptiveSampling(matrixA, matrixC1, c2)
matrixC2 = matrixA[:, indexC2]
matrixC = np.concatenate((matrixC1, matrixC2), 1)
The following code computes the error ratio $\frac{ \| {\bf A} - {\bf C} {\bf C}^\dagger {\bf A} \|_F }{ \|{\bf A}\|_F }$. Notice that ${\bf A} - {\bf C} {\bf C}^\dagger {\bf A}$ is the error incurred by projecting ${\bf A}$ to the column space of ${\bf C}$.
matrixQ, matrixR = np.linalg.qr(matrixC, mode='reduced')
matrixRes = matrixA - np.dot(matrixQ, np.dot(matrixQ.T, matrixA))
approxError = np.linalg.norm(matrixRes, 'fro') / np.linalg.norm(matrixA, 'fro')
print(approxError)
del(matrixR, matrixQ, matrixRes, approxError)
The theoretical properties of adaptive sampling has been analyzed in the following two papers:
Amit Deshpande, Luis Rademacher, Santosh Vempala, and Grant Wang. Matrix approximation and projective clustering via volume sampling. Theory of Computing, 2(2006): 225-247, 2006.
Shusen Wang and Zhihua Zhang. Improving CUR matrix decomposition and the Nystrom approximation via adaptive sampling. Journal of Machine Learning Research, 14: 2729-2769, 2013.
Let $k \ll m, n$ be any fixed integer.Deshpande et al. showed that
$$\mathbb{E} \big\| {\bf A} - {\bf C} {\bf C}^\dagger {\bf A} \big\|_F^2 \; \leq \; \min_{\textrm{rank} ({\bf M}) \leq k} \big\| {\bf A} - {\bf M} \big\|_F^2 + \frac{k}{c_2} \big\| {\bf A} - {\bf C}_1 {\bf C}_1^\dagger {\bf A} \big\|_F^2 ,$$where the expectation is taken w.r.t. the randomness in ${\bf C}_2$.
Let ${\bf R}$ be any fixed matrix of proper size and $\rho = \textrm{rank} ({\bf R})$. Wang & Zhang showed that
$$\mathbb{E} \big\| {\bf A} - {\bf C} {\bf C}^\dagger {\bf A} {\bf R}^\dagger {\bf R} \big\|_F^2 \; \leq \; \big\| {\bf A} - {\bf A} {\bf R}^\dagger {\bf R}\big\|_F^2 + \frac{\rho}{c_2} \big\| {\bf A} - {\bf C}_1 {\bf C}_1^\dagger {\bf A} \big\|_F^2 ,$$where the expectation is also taken w.r.t. the randomness in ${\bf C}_2$.
The adaptive sampling algorithm has very strong error bound and good empirical performance, but it is computationally inefficient. Specifically, it has two major disadvantages.
To overcome the two disadvantages of adaptive sampling, an efficient heuristic has been proposed by the following paper.
Shusen Wang, Luo Luo, and Zhihua Zhang. SPSD matrix approximation vis column selection: theories, algorithms, and extensions. Journal of Machine Learning Research, 17(49):1−49, 2016. (JMLR 2016).
The basic idea is to first uniformly sample $o(n)$ columns and then down-sample to $c_2$ columns by adaptive sampling. The algorithm can be described as follows.
def incompleteAdaptiveSampling(matrixA, matrixC1, c2):
n = matrixA.shape[1]
# oversampling (uniformly)
s = min(n, 5 * c2) # the oversampling parameter can be tuned
indexAtilde = np.random.choice(n, s, replace=False)
matrixAtilde = matrixA[:, indexAtilde]
# compute the residual
matrixQ, matrixR = np.linalg.qr(matrixC1, mode='reduced')
matrixQQAtilde = np.dot(matrixQ, np.dot(matrixQ.transpose(), matrixAtilde))
matrixRes = matrixAtilde - matrixQQAtilde
# compute the sampling probabilites
matrixRes = np.square(matrixRes)
prob = sum(matrixRes)
prob = prob / sum(prob)
indexC2 = np.random.choice(s, c2, replace=False, p=prob)
return indexAtilde[indexC2]
In the following we compare adaptive sampling with the incomplete adaptive sampling in terms of the error ratio $\frac{ \| {\bf A} - {\bf C} {\bf C}^\dagger {\bf A} \|_F }{ \|{\bf A}\|_F }$. We also employ uniform sampling for comparison.
c1 = 100;
c2 = 200;
c = c1 + c2;
normA = np.linalg.norm(matrixA, 'fro')
# matrix C1 contains c1 uniformly sampled columns of matrix A
indexC1 = np.random.choice(matrixA.shape[1], c1, replace=False)
matrixC1 = matrixA[:, indexC1]
# repeat the experiments 10 times
repeat = 10
errorUniform = list()
errorIncomplete = list()
errorAdaptive = list()
for i in range(repeat):
# uniform sampling
indexC = np.random.choice(matrixA.shape[1], c, replace=False)
matrixC = matrixA[:, indexC]
matrixQ, matrixR = np.linalg.qr(matrixC, mode='reduced')
matrixRes = matrixA - np.dot(matrixQ, np.dot(matrixQ.T, matrixA))
errorUniform.append(np.linalg.norm(matrixRes, 'fro') / normA)
del(indexC, matrixC, matrixQ, matrixR, matrixRes)
# incomplete adaptive sampling
indexC2 = incompleteAdaptiveSampling(matrixA, matrixC1, c2)
matrixC2 = matrixA[:, indexC2]
matrixC = np.concatenate((matrixC1, matrixC2), 1)
matrixQ, matrixR = np.linalg.qr(matrixC, mode='reduced')
matrixRes = matrixA - np.dot(matrixQ, np.dot(matrixQ.T, matrixA))
errorIncomplete.append(np.linalg.norm(matrixRes, 'fro') / normA)
del(indexC2, matrixC2, matrixC, matrixQ, matrixR, matrixRes)
# adaptive sampling
indexC2 = adaptiveSampling(matrixA, matrixC1, c2)
matrixC2 = matrixA[:, indexC2]
matrixC = np.concatenate((matrixC1, matrixC2), 1)
matrixQ, matrixR = np.linalg.qr(matrixC, mode='reduced')
matrixRes = matrixA - np.dot(matrixQ, np.dot(matrixQ.T, matrixA))
errorAdaptive.append(np.linalg.norm(matrixRes, 'fro') / normA)
del(indexC2, matrixC2, matrixC, matrixQ, matrixR, matrixRes)
print('The error of the uniform sampling is ', np.mean(errorUniform))
print('The error of the incomplete adaptive sampling is ', np.mean(errorIncomplete))
print('The error of the adaptive sampling is ', np.mean(errorAdaptive))
The experiment shows that the incomplete adaptive sampling is as good as the standard adaptive sampling and that they are both better than uniform sampling.