Adaptive Sampling

Shusen Wang, wssatzju@gmail.com

1. Preparations

We first prepare the data for our experiments.

Let us load a matrix from the file '../data/letters.txt'. The matix ${\bf X}\in \mathbb{R}^{15,000\times 16}$ has $15,000$ samples and $16$ features. Let ${\bf X}_1$ contain the first $2,000$ rows and ${\bf X}_2$ contain the next $5,000$ rows. We then form the $2,000\times 5,000$ kernel matrix ${\bf A}$ using the rbf function.

The rbf function is defined in 'rbf.py'. We describe it in the following.

  • Input:
    • ${\bf X}_1 \in \mathbb{R}^{n_1\times d}$ and ${\bf X}_2 \in \mathbb{R}^{n_2\times d}$: each row of which is $d$ dimensional data sample;
    • $\sigma > 0$: the kernel width parameter.
  • Output:
    • ${\bf K} \in \mathbb{R}^{n_1\times n_2}$: the kernel matrix. Let ${\bf x}_1$ be the $i$-th row of ${\bf X}_1$ and ${\bf x}_2$ be the $j$-th row of ${\bf X}_2$. Then the $(i,j)$-th entry of ${\bf K}$ is $$k_{i j} = \exp \bigg( - \frac{ \|{\bf x}_1 - {\bf x}_2 \|_2^2 }{ 2 \sigma^2 } \bigg) .$$
In [1]:
# prepare data
import numpy as np
from rbf import rbf

# load matrix X from file
filepath = '../data/letters.txt'
matrixX = np.loadtxt(filepath)

# form the RBF kernel matrix
matrixX1 = matrixX[0:2000, :]
matrixX2 = matrixX[2000:7000, :]
sigma = 0.7
matrixA = rbf(matrixX1, matrixX2, sigma)
print('The matrix A has ', matrixA.shape[0], ' rows and ', matrixA.shape[1], ' columns.')
The matrix A has  2000  rows and  5000  columns.

2. Algorithm Description

In a word, adaptive sampling is to select columns according to the residual of the previous round. Adaptive sampling can be described in the following.

  1. Input: ${\bf A}\in \mathbb{R}^{m\times n}$, ${\bf C}_1 \in \mathbb{R}^{m\times c_1}$, $c_2$.
  2. Compute the residual ${\bf B} = {\bf A} - {\bf C}_1 {\bf C}_1^{\dagger} {\bf A}$;
  3. Compute the sampling probabilities: $p_j = \frac{\|{\bf b}_{:j}\|_2^2}{\|{\bf B}\|_F^2}$ for $j=1$ to $n$;
  4. Sample $c_2$ columns of ${\bf A}$ according to $p_1 , \cdots , p_n$ to form ${\bf C}_2 \in \mathbb{R}^{m\times c_2}$;
  5. Output: ${\bf C} = [ {\bf C}_1 , {\bf C}_2 ] \in \mathbb{R}^{m\times (c_1+c_2)}$, which the stack of ${\bf C}_1$ and ${\bf C}_2$.

The following Python code implements the adaptive sampling algorithm.

In [2]:
import numpy as np

def adaptiveSampling(matrixA, matrixC1, c2):
    # compute the residual
    matrixQ, matrixR = np.linalg.qr(matrixC1, mode='reduced')
    matrixQQA = np.dot(matrixQ, np.dot(matrixQ.transpose(), matrixA))
    matrixRes = matrixA - matrixQQA
    # compute the sampling probabilites
    matrixRes = np.square(matrixRes)
    prob = sum(matrixRes)
    prob = prob / sum(prob)
    return np.random.choice(matrixA.shape[1], c2, replace=False, p=prob)

c1 = 100;
c2 = 200;
# matrix C1 contains c1 uniformly sampled columns of matrix A
indexC1 = np.random.choice(matrixA.shape[1], c1, replace=False)
matrixC1 = matrixA[:, indexC1]
indexC2 = adaptiveSampling(matrixA, matrixC1, c2)
matrixC2 = matrixA[:, indexC2]
matrixC = np.concatenate((matrixC1, matrixC2), 1)

The following code computes the error ratio $\frac{ \| {\bf A} - {\bf C} {\bf C}^\dagger {\bf A} \|_F }{ \|{\bf A}\|_F }$. Notice that ${\bf A} - {\bf C} {\bf C}^\dagger {\bf A}$ is the error incurred by projecting ${\bf A}$ to the column space of ${\bf C}$.

In [3]:
matrixQ, matrixR = np.linalg.qr(matrixC, mode='reduced')
matrixRes = matrixA - np.dot(matrixQ, np.dot(matrixQ.T, matrixA))
approxError = np.linalg.norm(matrixRes, 'fro') / np.linalg.norm(matrixA, 'fro')
print(approxError)
del(matrixR, matrixQ, matrixRes, approxError)
0.0391245213938

3. Theory

The theoretical properties of adaptive sampling has been analyzed in the following two papers:

Amit Deshpande, Luis Rademacher, Santosh Vempala, and Grant Wang. Matrix approximation and projective clustering via volume sampling. Theory of Computing, 2(2006): 225-247, 2006.

Shusen Wang and Zhihua Zhang. Improving CUR matrix decomposition and the Nystrom approximation via adaptive sampling. Journal of Machine Learning Research, 14: 2729-2769, 2013.

Let $k \ll m, n$ be any fixed integer.Deshpande et al. showed that

$$\mathbb{E} \big\| {\bf A} - {\bf C} {\bf C}^\dagger {\bf A} \big\|_F^2 \; \leq \; \min_{\textrm{rank} ({\bf M}) \leq k} \big\| {\bf A} - {\bf M} \big\|_F^2 + \frac{k}{c_2} \big\| {\bf A} - {\bf C}_1 {\bf C}_1^\dagger {\bf A} \big\|_F^2 ,$$

where the expectation is taken w.r.t. the randomness in ${\bf C}_2$.

Let ${\bf R}$ be any fixed matrix of proper size and $\rho = \textrm{rank} ({\bf R})$. Wang & Zhang showed that

$$\mathbb{E} \big\| {\bf A} - {\bf C} {\bf C}^\dagger {\bf A} {\bf R}^\dagger {\bf R} \big\|_F^2 \; \leq \; \big\| {\bf A} - {\bf A} {\bf R}^\dagger {\bf R}\big\|_F^2 + \frac{\rho}{c_2} \big\| {\bf A} - {\bf C}_1 {\bf C}_1^\dagger {\bf A} \big\|_F^2 ,$$

where the expectation is also taken w.r.t. the randomness in ${\bf C}_2$.

4. Drawbacks of Adaptive Sampling

The adaptive sampling algorithm has very strong error bound and good empirical performance, but it is computationally inefficient. Specifically, it has two major disadvantages.

  • It requires forming the residual matrix ${\bf A} - {\bf C}_1 {\bf C}_1^\dagger {\bf A}$, which costs $O (m n c_1)$ time.
  • Every entry of ${\bf A}$ must be visited. As for kernel methods, it is prohibitive to form the whole kernel matrix when the number of data sample is large.

5. The Incomplete Adaptive Sampling Algorithm

To overcome the two disadvantages of adaptive sampling, an efficient heuristic has been proposed by the following paper.

Shusen Wang, Luo Luo, and Zhihua Zhang. SPSD matrix approximation vis column selection: theories, algorithms, and extensions. Journal of Machine Learning Research, 17(49):1−49, 2016. (JMLR 2016).

The basic idea is to first uniformly sample $o(n)$ columns and then down-sample to $c_2$ columns by adaptive sampling. The algorithm can be described as follows.

  1. Input: ${\bf A}\in \mathbb{R}^{m\times n}$, ${\bf C}_1 \in \mathbb{R}^{m\times c_1}$, $c_2$.
  2. Set an oversampling parameter $s > c_2$ and uniformly sample $s$ columns of ${\bf A}$ to form $\tilde{\bf A} \in \mathbb{R}^{m\times s}$;
  3. Compute the residual ${\bf B} = \tilde{\bf A} - {\bf C}_1 {\bf C}_1^{\dagger} \tilde{\bf A}$;
  4. Compute the sampling probabilities: $p_j = \frac{\|{\bf b}_{:j}\|_2^2}{\|{\bf B}\|_F^2}$ for $j=1$ to $s$;
  5. Sample $c_2$ columns of $\tilde{\bf A}$ according to $p_1 , \cdots , p_s$ to form ${\bf C}_2 \in \mathbb{R}^{m\times c_2}$;
  6. Output: ${\bf C} = [ {\bf C}_1 , {\bf C}_2 ] \in \mathbb{R}^{m\times (c_1+c_2)}$, which the stack of ${\bf C}_1$ and ${\bf C}_2$.
In [4]:
def incompleteAdaptiveSampling(matrixA, matrixC1, c2):
    n = matrixA.shape[1]
    # oversampling (uniformly)
    s = min(n, 5 * c2) # the oversampling parameter can be tuned
    indexAtilde = np.random.choice(n, s, replace=False)
    matrixAtilde = matrixA[:, indexAtilde]
    # compute the residual
    matrixQ, matrixR = np.linalg.qr(matrixC1, mode='reduced')
    matrixQQAtilde = np.dot(matrixQ, np.dot(matrixQ.transpose(), matrixAtilde))
    matrixRes = matrixAtilde - matrixQQAtilde
    # compute the sampling probabilites
    matrixRes = np.square(matrixRes)
    prob = sum(matrixRes)
    prob = prob / sum(prob)
    indexC2 = np.random.choice(s, c2, replace=False, p=prob)
    return indexAtilde[indexC2]

6. Experiments

In the following we compare adaptive sampling with the incomplete adaptive sampling in terms of the error ratio $\frac{ \| {\bf A} - {\bf C} {\bf C}^\dagger {\bf A} \|_F }{ \|{\bf A}\|_F }$. We also employ uniform sampling for comparison.

In [5]:
c1 = 100;
c2 = 200;
c = c1 + c2;
normA = np.linalg.norm(matrixA, 'fro')

# matrix C1 contains c1 uniformly sampled columns of matrix A
indexC1 = np.random.choice(matrixA.shape[1], c1, replace=False)
matrixC1 = matrixA[:, indexC1]

# repeat the experiments 10 times
repeat = 10
errorUniform = list()
errorIncomplete = list()
errorAdaptive = list()
for i in range(repeat):
    # uniform sampling
    indexC = np.random.choice(matrixA.shape[1], c, replace=False)
    matrixC = matrixA[:, indexC]
    matrixQ, matrixR = np.linalg.qr(matrixC, mode='reduced')
    matrixRes = matrixA - np.dot(matrixQ, np.dot(matrixQ.T, matrixA))
    errorUniform.append(np.linalg.norm(matrixRes, 'fro') / normA)
    del(indexC, matrixC, matrixQ, matrixR, matrixRes)
    # incomplete adaptive sampling
    indexC2 = incompleteAdaptiveSampling(matrixA, matrixC1, c2)
    matrixC2 = matrixA[:, indexC2]
    matrixC = np.concatenate((matrixC1, matrixC2), 1)
    matrixQ, matrixR = np.linalg.qr(matrixC, mode='reduced')
    matrixRes = matrixA - np.dot(matrixQ, np.dot(matrixQ.T, matrixA))
    errorIncomplete.append(np.linalg.norm(matrixRes, 'fro') / normA)
    del(indexC2, matrixC2, matrixC, matrixQ, matrixR, matrixRes)
    # adaptive sampling
    indexC2 = adaptiveSampling(matrixA, matrixC1, c2)
    matrixC2 = matrixA[:, indexC2]
    matrixC = np.concatenate((matrixC1, matrixC2), 1)
    matrixQ, matrixR = np.linalg.qr(matrixC, mode='reduced')
    matrixRes = matrixA - np.dot(matrixQ, np.dot(matrixQ.T, matrixA))
    errorAdaptive.append(np.linalg.norm(matrixRes, 'fro') / normA)
    del(indexC2, matrixC2, matrixC, matrixQ, matrixR, matrixRes)

print('The error of the uniform sampling is ', np.mean(errorUniform))
print('The error of the incomplete adaptive sampling is ', np.mean(errorIncomplete))
print('The error of the adaptive sampling is ', np.mean(errorAdaptive))
The error of the uniform sampling is  0.0414388338643
The error of the incomplete adaptive sampling is  0.0386140974054
The error of the adaptive sampling is  0.0397159209344

The experiment shows that the incomplete adaptive sampling is as good as the standard adaptive sampling and that they are both better than uniform sampling.

In [ ]: