Shusen Wang, wssatzju@gmail.com

## 1. Preparations¶

We first prepare the data for our experiments.

Let us load a matrix from the file '../data/letters.txt'. The matix ${\bf X}\in \mathbb{R}^{15,000\times 16}$ has $15,000$ samples and $16$ features. Let ${\bf X}_1$ contain the first $2,000$ rows and ${\bf X}_2$ contain the next $5,000$ rows. We then form the $2,000\times 5,000$ kernel matrix ${\bf A}$ using the rbf function.

The rbf function is defined in 'rbf.py'. We describe it in the following.

• Input:
• ${\bf X}_1 \in \mathbb{R}^{n_1\times d}$ and ${\bf X}_2 \in \mathbb{R}^{n_2\times d}$: each row of which is $d$ dimensional data sample;
• $\sigma > 0$: the kernel width parameter.
• Output:
• ${\bf K} \in \mathbb{R}^{n_1\times n_2}$: the kernel matrix. Let ${\bf x}_1$ be the $i$-th row of ${\bf X}_1$ and ${\bf x}_2$ be the $j$-th row of ${\bf X}_2$. Then the $(i,j)$-th entry of ${\bf K}$ is $$k_{i j} = \exp \bigg( - \frac{ \|{\bf x}_1 - {\bf x}_2 \|_2^2 }{ 2 \sigma^2 } \bigg) .$$
In :
# prepare data
import numpy as np
from rbf import rbf

# load matrix X from file
filepath = '../data/letters.txt'

# form the RBF kernel matrix
matrixX1 = matrixX[0:2000, :]
matrixX2 = matrixX[2000:7000, :]
sigma = 0.7
matrixA = rbf(matrixX1, matrixX2, sigma)
print('The matrix A has ', matrixA.shape, ' rows and ', matrixA.shape, ' columns.')

The matrix A has  2000  rows and  5000  columns.


## 2. Algorithm Description¶

In a word, adaptive sampling is to select columns according to the residual of the previous round. Adaptive sampling can be described in the following.

1. Input: ${\bf A}\in \mathbb{R}^{m\times n}$, ${\bf C}_1 \in \mathbb{R}^{m\times c_1}$, $c_2$.
2. Compute the residual ${\bf B} = {\bf A} - {\bf C}_1 {\bf C}_1^{\dagger} {\bf A}$;
3. Compute the sampling probabilities: $p_j = \frac{\|{\bf b}_{:j}\|_2^2}{\|{\bf B}\|_F^2}$ for $j=1$ to $n$;
4. Sample $c_2$ columns of ${\bf A}$ according to $p_1 , \cdots , p_n$ to form ${\bf C}_2 \in \mathbb{R}^{m\times c_2}$;
5. Output: ${\bf C} = [ {\bf C}_1 , {\bf C}_2 ] \in \mathbb{R}^{m\times (c_1+c_2)}$, which the stack of ${\bf C}_1$ and ${\bf C}_2$.

The following Python code implements the adaptive sampling algorithm.

In :
import numpy as np

# compute the residual
matrixQ, matrixR = np.linalg.qr(matrixC1, mode='reduced')
matrixQQA = np.dot(matrixQ, np.dot(matrixQ.transpose(), matrixA))
matrixRes = matrixA - matrixQQA
# compute the sampling probabilites
matrixRes = np.square(matrixRes)
prob = sum(matrixRes)
prob = prob / sum(prob)
return np.random.choice(matrixA.shape, c2, replace=False, p=prob)

c1 = 100;
c2 = 200;
# matrix C1 contains c1 uniformly sampled columns of matrix A
indexC1 = np.random.choice(matrixA.shape, c1, replace=False)
matrixC1 = matrixA[:, indexC1]
matrixC2 = matrixA[:, indexC2]
matrixC = np.concatenate((matrixC1, matrixC2), 1)


The following code computes the error ratio $\frac{ \| {\bf A} - {\bf C} {\bf C}^\dagger {\bf A} \|_F }{ \|{\bf A}\|_F }$. Notice that ${\bf A} - {\bf C} {\bf C}^\dagger {\bf A}$ is the error incurred by projecting ${\bf A}$ to the column space of ${\bf C}$.

In :
matrixQ, matrixR = np.linalg.qr(matrixC, mode='reduced')
matrixRes = matrixA - np.dot(matrixQ, np.dot(matrixQ.T, matrixA))
approxError = np.linalg.norm(matrixRes, 'fro') / np.linalg.norm(matrixA, 'fro')
print(approxError)
del(matrixR, matrixQ, matrixRes, approxError)

0.0391245213938


## 3. Theory¶

The theoretical properties of adaptive sampling has been analyzed in the following two papers:

Amit Deshpande, Luis Rademacher, Santosh Vempala, and Grant Wang. Matrix approximation and projective clustering via volume sampling. Theory of Computing, 2(2006): 225-247, 2006.

Shusen Wang and Zhihua Zhang. Improving CUR matrix decomposition and the Nystrom approximation via adaptive sampling. Journal of Machine Learning Research, 14: 2729-2769, 2013.

Let $k \ll m, n$ be any fixed integer.Deshpande et al. showed that

$$\mathbb{E} \big\| {\bf A} - {\bf C} {\bf C}^\dagger {\bf A} \big\|_F^2 \; \leq \; \min_{\textrm{rank} ({\bf M}) \leq k} \big\| {\bf A} - {\bf M} \big\|_F^2 + \frac{k}{c_2} \big\| {\bf A} - {\bf C}_1 {\bf C}_1^\dagger {\bf A} \big\|_F^2 ,$$

where the expectation is taken w.r.t. the randomness in ${\bf C}_2$.

Let ${\bf R}$ be any fixed matrix of proper size and $\rho = \textrm{rank} ({\bf R})$. Wang & Zhang showed that

$$\mathbb{E} \big\| {\bf A} - {\bf C} {\bf C}^\dagger {\bf A} {\bf R}^\dagger {\bf R} \big\|_F^2 \; \leq \; \big\| {\bf A} - {\bf A} {\bf R}^\dagger {\bf R}\big\|_F^2 + \frac{\rho}{c_2} \big\| {\bf A} - {\bf C}_1 {\bf C}_1^\dagger {\bf A} \big\|_F^2 ,$$

where the expectation is also taken w.r.t. the randomness in ${\bf C}_2$.

## 4. Drawbacks of Adaptive Sampling¶

The adaptive sampling algorithm has very strong error bound and good empirical performance, but it is computationally inefficient. Specifically, it has two major disadvantages.

• It requires forming the residual matrix ${\bf A} - {\bf C}_1 {\bf C}_1^\dagger {\bf A}$, which costs $O (m n c_1)$ time.
• Every entry of ${\bf A}$ must be visited. As for kernel methods, it is prohibitive to form the whole kernel matrix when the number of data sample is large.

## 5. The Incomplete Adaptive Sampling Algorithm¶

To overcome the two disadvantages of adaptive sampling, an efficient heuristic has been proposed by the following paper.

Shusen Wang, Luo Luo, and Zhihua Zhang. SPSD matrix approximation vis column selection: theories, algorithms, and extensions. Journal of Machine Learning Research, 17(49):1−49, 2016. (JMLR 2016).

The basic idea is to first uniformly sample $o(n)$ columns and then down-sample to $c_2$ columns by adaptive sampling. The algorithm can be described as follows.

1. Input: ${\bf A}\in \mathbb{R}^{m\times n}$, ${\bf C}_1 \in \mathbb{R}^{m\times c_1}$, $c_2$.
2. Set an oversampling parameter $s > c_2$ and uniformly sample $s$ columns of ${\bf A}$ to form $\tilde{\bf A} \in \mathbb{R}^{m\times s}$;
3. Compute the residual ${\bf B} = \tilde{\bf A} - {\bf C}_1 {\bf C}_1^{\dagger} \tilde{\bf A}$;
4. Compute the sampling probabilities: $p_j = \frac{\|{\bf b}_{:j}\|_2^2}{\|{\bf B}\|_F^2}$ for $j=1$ to $s$;
5. Sample $c_2$ columns of $\tilde{\bf A}$ according to $p_1 , \cdots , p_s$ to form ${\bf C}_2 \in \mathbb{R}^{m\times c_2}$;
6. Output: ${\bf C} = [ {\bf C}_1 , {\bf C}_2 ] \in \mathbb{R}^{m\times (c_1+c_2)}$, which the stack of ${\bf C}_1$ and ${\bf C}_2$.
In :
def incompleteAdaptiveSampling(matrixA, matrixC1, c2):
n = matrixA.shape
# oversampling (uniformly)
s = min(n, 5 * c2) # the oversampling parameter can be tuned
indexAtilde = np.random.choice(n, s, replace=False)
matrixAtilde = matrixA[:, indexAtilde]
# compute the residual
matrixQ, matrixR = np.linalg.qr(matrixC1, mode='reduced')
matrixQQAtilde = np.dot(matrixQ, np.dot(matrixQ.transpose(), matrixAtilde))
matrixRes = matrixAtilde - matrixQQAtilde
# compute the sampling probabilites
matrixRes = np.square(matrixRes)
prob = sum(matrixRes)
prob = prob / sum(prob)
indexC2 = np.random.choice(s, c2, replace=False, p=prob)
return indexAtilde[indexC2]


## 6. Experiments¶

In the following we compare adaptive sampling with the incomplete adaptive sampling in terms of the error ratio $\frac{ \| {\bf A} - {\bf C} {\bf C}^\dagger {\bf A} \|_F }{ \|{\bf A}\|_F }$. We also employ uniform sampling for comparison.

In :
c1 = 100;
c2 = 200;
c = c1 + c2;
normA = np.linalg.norm(matrixA, 'fro')

# matrix C1 contains c1 uniformly sampled columns of matrix A
indexC1 = np.random.choice(matrixA.shape, c1, replace=False)
matrixC1 = matrixA[:, indexC1]

# repeat the experiments 10 times
repeat = 10
errorUniform = list()
errorIncomplete = list()
for i in range(repeat):
# uniform sampling
indexC = np.random.choice(matrixA.shape, c, replace=False)
matrixC = matrixA[:, indexC]
matrixQ, matrixR = np.linalg.qr(matrixC, mode='reduced')
matrixRes = matrixA - np.dot(matrixQ, np.dot(matrixQ.T, matrixA))
errorUniform.append(np.linalg.norm(matrixRes, 'fro') / normA)
del(indexC, matrixC, matrixQ, matrixR, matrixRes)
matrixC2 = matrixA[:, indexC2]
matrixC = np.concatenate((matrixC1, matrixC2), 1)
matrixQ, matrixR = np.linalg.qr(matrixC, mode='reduced')
matrixRes = matrixA - np.dot(matrixQ, np.dot(matrixQ.T, matrixA))
errorIncomplete.append(np.linalg.norm(matrixRes, 'fro') / normA)
del(indexC2, matrixC2, matrixC, matrixQ, matrixR, matrixRes)
matrixC2 = matrixA[:, indexC2]
matrixC = np.concatenate((matrixC1, matrixC2), 1)
matrixQ, matrixR = np.linalg.qr(matrixC, mode='reduced')
matrixRes = matrixA - np.dot(matrixQ, np.dot(matrixQ.T, matrixA))
del(indexC2, matrixC2, matrixC, matrixQ, matrixR, matrixRes)

print('The error of the uniform sampling is ', np.mean(errorUniform))
print('The error of the incomplete adaptive sampling is ', np.mean(errorIncomplete))

The error of the uniform sampling is  0.0414388338643