Chapter 9 Uncertainty Quantification
Imagine there is an unknown quantity \(u\) that we can’t observe directly. However, we may obtain a collection of observations \(Y = (y_i, \cdots, y_p) \in \mathcal R^p\). Then the Forward Problem is \(\underset{\text{unknown}}{u \in Z} \to \underset{\text{observed}}{Y}\).
In a sense, when we talk about science, we are talking about use \(Y\) to learn about \(u\). In practice, there are at least two “sources” of uncertainty associated to the picture: \(u \to Y\):
\(u\) is unknown.
Observations are usually contaminated by some sort of noise so that in general \(Y\) is not simply a function of the unknown \(u\). In statistics, we may model the “likelihood” or “probability” of observing \(Y\) given a certain value of \(u\). This quantifies our uncertainty of the possible values \(Y\) given the unknown \(u\) through the probability mass function: \(P(Y = y, | u)\) or through a density function: \(P(Y \in C | u) = \int_C p(y | u) dy\).
Example 9.1 Let \(u \in (0,1)\) be the probability that the outcome of a certain biased coin flip is H. Suppose that this coin is flipped \(p\) times and we get \(y_i = \begin{cases}1 &i\text{-th is H}\\0 & i\text{-th is T}\end{cases}\).
In this case, the likelihood is
\[ P(Y = y | u) = u^{\sum_{i=1}^p y_i} (1-u)^{p - \sum_{i=1}^p y_i}, ~y_i \in \{0,1\} \]
Example 9.2
Regression: \(u: \mathcal X \to \mathcal R\) (Hidden regression function). \(y_i = u(x_i) + \varepsilon_i\), \(\varepsilon_i \sim N(0, \sigma^2)\).
\[ p(y|u) = \frac{1}{(\sqrt{2 \pi \sigma^2})^p} exp(-\frac{||y - (u(x_1), \cdots, u(x_n)||^2}{2\sigma^2}) \]Classification with logistic model: Given \(u: \mathcal X \to \mathcal R\), define: \(q_i = \frac{exp(u(x_i))}{1+exp(u(x_i))}\) for \(x_1, \cdots, x_p\). Then, \(y_i = \begin{cases}1 &\text{with prob } q_i\\0 & \text{with prob } 1-q_i\end{cases}\). Thus,
\[ P(y | u) = \prod_{i=1}^p q_i^{y_i} (1- q_i)^{1-y_i} = \prod_{i=1}^p \frac{exp(y_iu(x_i))}{1+exp(u(x_i))} \]
Example 9.3 (Density Estimation) Let \(\mathcal X = \mathcal R^d\) and suppose that \(u: \mathcal X \to \mathcal R\) is an unknown density function (\(u(x) \geq 0\), \(\forall x\), and \(\int_{\mathcal R^d} u(x) dx = 1\)), \(y_1, \cdots, y_p \underset{i.i.d}{\sim} u\). Thus, the likelihood of \(Y=(y_1, \cdots, y_p)\) is:
\[
p(y|u) = u(y_1) \cdots u(y_p)
\]
So far we have talked about the distribution of \(Y\) given \(u\) (Forward Problem). This models the uncertainty of our “measurement” if we had access to \(u\). However, how do we learn \(u\) from \(Y\) (Inverse Problem)? Moreover, how do we quantify the uncertainty of the unknown \(u\) before observing \(Y\), and after observing \(Y\)?
9.1 Bayes Perspective
Bayes Rule: \(p(u | y) \propto p(y | u) \pi(u)\). For the moment, we consider the case where \(u \in \mathcal R^d\). Suppose \(\pi(u)\) is then interpreted as a density in \(\mathcal R^d\). The posterior is usually also a density in \(\mathcal R^d\).
Example 9.4 (Example 9.1 Continued) Let \(u \in (0,1)\), the likelihood is \(Bernoulli(u):\)
\[ P(Y = y | u) = u^{\sum_{i=1}^p y_i} (1-u)^{p - \sum_{i=1}^p y_i}, ~y_i \in \{0,1\}, \]
the prior is \(Beta(\alpha, \beta)\):
\[ \pi(u) = \frac{u^{\alpha-1}(1-u)^{\beta-1}}{\Gamma(\alpha, \beta)}, ~\alpha, \beta >0. \]
Then the posterior is
\[ p(u | y) \propto p(y |u) \pi(u) \propto u^{\sum_{i=1}^p y_i + \alpha - 1} (1-u)^{p - \sum_{i=1}^p y_i + \beta - 1}, \]
that is, given \(Y=y\), \(u\) is \(Beta(\sum_{i=1}^p y_i + \alpha, p - \sum_{i=1}^p y_i + \beta)\)
Example 9.5 (Example 9.2 Continued) Let \(u \in \mathcal R^d\) and \(x_1, \cdots, x_p \in \mathcal R^d\), the likelihood is:
\[ p(y|u) = \frac{1}{(\sqrt{2 \pi \sigma^2})^p} exp(-\frac{||y - (u(x_1), \cdots, u(x_n)||^2}{2\sigma^2}), \]
the prior is
\[ \pi(u) = \frac{1}{(\sqrt{2 \pi \lambda^{-1}})^p} exp(-\frac{\lambda}{2} ||u||^2_{\mathcal R^d}), ~i.e. ~ u \sim N(\vec 0, \frac{1}{\lambda} I). \]
Let \(X = (x_1, \cdots, x_p)^T_{p \times d}\). Then the posterior is
\[ \begin{aligned} p(u | y) &\propto p(y |u) \pi(u) \\ &\propto exp \Big(<\frac{X^Ty}{\sigma^2}, u> - \frac{1}{2} <(\lambda I + \frac{X^TX}{\sigma^2})u, u>\Big)\\ &= N(\mu, \Sigma), \end{aligned} \]
where \(\mu = \frac{1}{\sigma^2} (\lambda I + \frac{1}{\sigma^2} (X^TX))^{-1} X^Ty\) and \(\Sigma= (\lambda I + \frac{1}{\sigma^2} (X^TX))^{-1}\).
With the posterior, we can then compute the following:
- Posterior Mean:
\[ \begin{aligned} \hat u_{PM} = E[u_i | y] &= \frac{\int_{\mathcal R^d} u_i p(y | u) \pi(u) du_1 \cdots du_d}{\int_{\mathcal R^d} p(y | u) \pi(u) du_1 \cdots du_d}\\ &= \frac{\int_{\mathcal R^d} u_i p(y | u) \pi(u) du_1 \cdots du_d}{Z(y)} \end{aligned} \]
where \(Z(y)\) is the normalization constant.
Then, based on the observations (and model), we estimate the i-th coordinate of the unknown as \(E[u_i | y]\).
We could also estimate \(u_i^2 cos(u_j)\) with: \(E[u_i^2 cos(u_j) | y] = \frac{\int_{\mathcal R^d} u_i^2cos(u_j) p(y | u) \pi(u) du_1 \cdots du_d}{Z(y)}\)
Posterior Covariance: \(Cov(u_i, u_j | y)\) or \(Var(u_i|y) = E[u_i^2 | y] - (E[u|y])^2\). A measure of how certain we are about the estimation of \(u_i\).
Map (Maximum a posterior):
\[ \begin{aligned} u^* &= \underset{u}{\operatorname{arg max}} \pi(u) p(y | u)\\ &= \underset{u}{\operatorname{arg max}} log(\pi(u)) + log(p(y | u))\\ &= \underset{u}{\operatorname{arg min}} -log(\pi(u)) - log(p(y | u)) \end{aligned} \]
Both 1 and 2 are based on being able to take expectation w.r.t. posterior. 3 on the other hand, is not related to expectation but rather to optimization.
Example 9.6 (Example 9.5 Continued) We have:
\(E[u_i |y] = \mu_i\)
\(Cov(u_iu_j | y) = \Sigma_{ij}\)
The Map is determined by:
\[ \begin{aligned} u^* &= \underset{u}{\operatorname{arg min}} -log(p(u | y))\\ &= \underset{u}{\operatorname{arg min}} -log(\pi(u)) - log(p(y | u))\\ &= \underset{u}{\operatorname{arg min}} \frac{\lambda}{2} ||u||^2 + \frac{1}{\sigma^2} \sum_{j=1}^p (y_i - <u, x_i>)^2, \end{aligned} \]
which is just the ridge regression.
Remark. \[ u^* = \mu \]
This can be seen by direct optimization or by noticing that the Map of a Gaussian random vector coincides with its mean vector.
The posterior can be interpreted as a different kind of regularization:
\[ p(u | y) \propto \underset{\text{Fidality}}{p(y | u)} ~\underset{\text{Regularizer}}{\pi(u)} \]
Back to Example 9.6, when \(u | y\) is \(N(\mu, \Sigma)\), we have formulas for \(E[u_i | y]\) and \(Cov(u_i, u_j)\). However, something like \(E[u_1cos(u_2) exp(u_3) | y]\) would be quite impossible to compute. Nevertheless, we can use Monte-Carlo to approximate this. The idea is that we know what the posterior is. Thus, we can try to sample \(u^1, \cdots, u^K \sim N_d(\mu, \Sigma)\). Then consider the empirical average: \(\frac{1}{K} \sum{j=1}^K u_1^j cos(u_2^j) exp(u_3^j)\). That is, if we can sample from the posterior, we can numerically approximate \(E[G(u) | y] \approx \frac{1}{K} \sum_{j=1}^K G(u^j)\), where \(u^1, \cdots, u^K \underset{i.i.d}{\sim} Posterior\).
If it is difficult to sample from the posterior, we can use MCMC (Monte Carlo Markov Chain).
9.2 Gaussians in \(\mathcal R^d\)
Definition 9.1 A Gaussian random vector \(Z\) with mean zero and covariance \(\Sigma \in \mathcal R_{d\times d}\) (positive definite matrix) is a random vector with density (in \(\mathcal R^d\)):
\[ p(z) = \frac{1}{\sqrt{(2 \pi)^d ~det(\Sigma)}} \cdot exp(-\frac{1}{2} < \Sigma^{-1} z, z>), \]
and we write \(Z \sim N(0, \Sigma)\).
We can characterize this distribution via its characteristic function: \(\phi_Z(v):= E[e^{\textbf{i} <Z, v>}]_{V \in \mathcal R^d} = e^{-\frac{1}{2} <\Sigma v, v>}\).
Notice that we can see \(Z\) as a collection of real valued random variables: \(Z = (Z_1, \cdots, Z_d)\), namely the coordinates of \(Z\). For convenience let’s write \(Z\) in the following form: \(\{Z_x\}_{x \in \mathcal X}\), where \(\mathcal X\) is the set \(\mathcal X = \{1, \cdots, d\}\).
Proposition 9.1 Let \(\mathcal X = \{1, \cdots, d\}\), \(\{Z_x\}_{x \in \mathcal X}\) a collection of real valued random variables is a Gaussian iff \(\forall\) (finite) subset \(\mathcal X' \subseteq \mathcal X\), the collection \(\{Z_x\}_{x \in \mathcal X'}\) is a Gaussian.
Proof.
“\(\Leftarrow\)”: It is obvious since we can pick \(\mathcal X' = \mathcal X\).
“\(\Rightarrow\)”: For simplicity, suppose \(\mathcal X' = \{1, \cdots, m\}\), \(m \leq d\). Let \(\tilde Z = \{Z_x\}_{x \in \mathcal X'}\), \(w \in \mathcal R^m\). Then we have:
\[ \begin{aligned} E[e^{\textbf{i} <\tilde Z, w>_{\mathcal R^m}}] &= E[e^{\textbf{i} <\tilde Z,uw>_{\mathcal R^m}}]\\ &(u = (w_1, \cdots, w_m, 0, \cdots, 0))\\ &= e^{-\frac{1}{2} < \Sigma u, u >_{\mathcal R^d}}\\ &= e^{-\frac{1}{2} < \tilde \Sigma w, w >_{\mathcal R^m}}, \end{aligned} \]
where \(\tilde \Sigma \in \mathcal R^{m \times m}\) and \(\tilde \Sigma_{ij} = \Sigma_{ij}\), \(i,j = 1,\cdots,m\).
Since the characteristic function of \(\tilde Z\) is that of a Gaussian in \(\mathcal R^m\), we have \(\tilde Z \sim N_m(0, \tilde \Sigma)\).
Proposition 9.2 Let \(Z\) be a Gaussian random vector in \(\mathcal R^d: Z \sim N_d(0, \Sigma)\).
If \(A \in \mathcal R^{m \times d}\) is a matrix, then \(U = AZ\) is a Gaussian in \(\mathcal R^m\) and \(U \sim N_m(0, A\Sigma A^T)\).
If \(Z \sim N_d(0, \Sigma)\), \(\tilde Z \sim N_m(0, \tilde \Sigma)\), and \(Z, \tilde Z\) are independent, then \(Z, \tilde Z \sim N_{d+m}(0, \left(\begin{matrix}\Sigma&0\\0&\tilde \Sigma\end{matrix}\right))\).
\(Z\) is a Gaussian in \(\mathcal R^d\) iff every linear combination of its coordinates is univariate (单元的) Gaussian:
\[ Z = (Z_1, \cdots, Z_d) \sim N_d \Leftrightarrow \forall a_1, \cdots, a_d \in \mathcal R, ~\sum_{j=1}^d a_j Z_j \sim N_1. \]
- Suppose that \(Z \sim N_d(0, \Sigma)\), and let \(v, \tilde v \in \mathcal R^d\) be two vectors, then
\[ Cov(<Z, v>, <Z, \tilde v>) = <\Sigma v, \tilde v> = <\Sigma \tilde v, v> \]
- Suppose \(Z = (Z_1, \cdots, Z_d)\) is Gaussian with cov. matrix \(\Sigma\), then the coordinates of \(Z\) are independent random variables iff \(\Sigma\) is a diagonal matrix.
Proof.
\[ \begin{aligned} E[e^{\textbf{i}<w,U>_{\mathcal R^m}}] &= E[e^{\textbf{i}<w,AZ>_{\mathcal R^m}}]\\ &\underset{<A,B> = tr(A^TB)}{=} E[e^{\textbf{i}<A^Tw,Z>_{\mathcal R^m}}]\\ &= e^{-\frac{1}{2} < \Sigma (A^Tw), A^Tw>_{\mathcal R^d}}\\ &= e^{-\frac{1}{2} < A \Sigma A^Tw, w>_{\mathcal R^d}} \end{aligned} \]
- “\(\Rightarrow\)”: Let \(U = \sum_{j=1}^d a_j Z_j\) be a linear combination of the coordinates of \(Z\). By 1, we have \(U=AZ\), where \(A = (a_1, \cdots, a_d)\), is a Gaussian.
“\(\Leftarrow\)”: Let \(a \in \mathcal R^d\), with \(a = (a_1, \cdots, a_d)\). Then, \(E[e^{\textbf{i}<a,Z>}] = E[e^{\textbf{i} \sum_{j=1}^d a_j Z_j}] \underset{\sum_{j=1}^d a_j Z_j \sim N(0, \sigma_a^2)}{=} e^{-\frac{1}{2} \sigma_a^2}\), where \(\sigma_a^2 = Var(\sum_{j=1}^d a_j Z_j) = <\Sigma a, a>_{\mathcal R^d}\), \(\Sigma\) is the covariance matrix of \(Z = (Z_1, \cdots, Z_d)\). Thus, \(E[e^{\textbf{i}<a,Z>}] = e^{-\frac{1}{2} <\Sigma a, a>_{\mathcal R^d}}\), \(\forall a \in \mathcal R^d\), which means that \(Z \sim N_d\).
\[ \begin{aligned} Cov(<Z, v>, <Z, \tilde v>) &= Cov(\sum_{i=1}^n Z_i v_i, \sum_{j=1}^n Z_j, \tilde v_j)\\ &= \sum_{i=1}^n \sum_{j=1}^n v_i \tilde v_j Cov(Z_i, Z_j)\\ &= \sum_{i=1}^n \sum_{j=1}^n (\Sigma_{ij}v_i \tilde v_j)\\ &= <\Sigma v, \tilde v> = <\Sigma \tilde v, v>. \end{aligned} \]
9.3 Diagonalization of \(\Sigma\) and Sampling for \(N(0, \Sigma)\)
Theorem 9.1 If \(\Sigma \in \mathcal R^{d \times d}\) is symmetric positive semi definite, then there are vectors \(v_1, \cdots, v_d \in \mathcal R^d\) and non-negative \(\lambda_1 \geq \cdots \geq \lambda_d \geq 0\) s.t \(\Sigma v_i = \lambda_i vi_i\) and \(<v_i, v_j>_{\mathcal R^d} = \begin{cases}1 & i=j\\0 & i \neq j\end{cases}\).
In short: \(\Sigma\) induces an orthonormal basis for \(\mathcal R^d\) formed of eigen vectors of \(\Sigma\). Moreover, all of \(\Sigma\)’s eigen values are non-negative.
- How to sample from \(Z \sim N_d(0, \Sigma)\):
Since \(Z \in \mathcal R^d\), we notice that \(Z = \sum_{i=1}^d <Z, v_i>_{\mathcal R^d} v_i\), where \(\{v_i\}\) is an orthonormal basis for \(\mathcal R^d\) and are fixed. We then have \(<Z, v_i> \sim N_1(0, <\Sigma v_i, v_i>) = N_1(0, \lambda_i <v_i, v_i>) = N_1(0, \lambda_i)\).
Moreover, for \(i \neq j\):
\[ \begin{aligned} Cov(<Z, v_i>, <Z, v_j>) &= <\Sigma v_i, v_j>\\ &= <\lambda_i v_i, v_j>\\ &= \lambda <v_i, v_j>\\ &= 0 \end{aligned} \]
which means that the random variables \(<Z, v_1>,\cdots, <Z,v_d>\) are independent.
So, we can write: \(Z = \sum_{i=1}^d \sqrt{\lambda_i}(\frac{<Z, v_i>}{\sqrt{\lambda_i}}) v_i\ = \sum_{i=1}^d \sqrt{\lambda_i}\xi_i v_i\), where \(\xi_1, \cdots, \xi_d\) are i.i.d \(N(0,1)\) random variables.
Thus, to sample from \(N_d(0, \Sigma)\), we can do the following:
Find the eigen pairs \(\{(\lambda_i, v_i)\}_{i=1,\cdots,d}\) for \(\Sigma\).
Sample \(\xi_1, \cdots, \xi_d \sim N(0,1)\).
Set \(Z = \sum_{i=1}^d \sqrt{\lambda_i}\xi_i v_i\).
Remark. The above is precisely what we need to sample form Gaussian in more general settings. In particular, for Gaussians in certain RKHSs.
9.4 Stochastic Processes and Gaussian Processes
Definition 9.2 (Stochastic Processes) Let \(\mathcal X\) be a set. A stochastic process over \(\mathcal X\) is a collection of random variables \(\{Z_x\}_{x \in \mathcal X}\).
Definition 9.3 (Gaussian Processes) A Gaussian Process over \(\mathcal X\) is a stochastic process \(\{Z_x\}_{x \in \mathcal X}\) with the property that \(\forall\) finite subset \(\mathcal X'\) of \(\mathcal X\), the vector \(\{Z_x\}_{x \in \mathcal X'}\) is a Gaussian.
Definition 9.4 (Covariance Funciton) Suppose \(Z = \{Z_x\}_{x \in \mathcal X}\) is a Gaussian Process over \(\mathcal X\). We define \(Z\)’s covariance function as:
\[ K: \mathcal X \times \mathcal X \to \mathcal R\\ K(x, \tilde x):= Cov(Z_x, Z_{\tilde x}) \]
We will write \(Z \sim N(0, K)\) for simplicity.
Proposition 9.3 The covariance function of a Gaussian process over \(\mathcal X\) is a kernel over \(\mathcal X\):
\(x_1, \cdots, x_n \in \mathcal X\), \(\{Z_{x_1}, \cdots, Z_{x_n}\}\) is a n-dimensional Gaussian. Thus \(\Sigma\) is positive semi definite and \(\Sigma_{ij} = K(x_i, x_j)\).
Let \(K: \mathcal R^d \times \mathcal R^d \to \mathcal R\) be a continuous kernel, for which \(|K(x, \tilde x)| \leq M\), where \(M\) is a constant. Let \(p: \mathcal R^d \to \mathcal R\) be a density function.
Consider the following operator:
\[ T_K: f \in \mathcal L^2(p) \to \int_{\mathcal R^d} K(\cdot, x) f(x) dx, \]
where \(\mathcal L^2(p) = \{f: \int_{\mathcal R^d} f^2 p(x)dx < \infty\}\). In particular, for \(f \in \mathcal L^2(p)\), \(T_K f\) is a function on \(\mathcal R^d\), defined by \(T_K f(\tilde x):= \int_{\mathcal R^d} K(\tilde x, x) f(x) p(x) dx\), \(\tilde x \in \mathcal R^d\).
Definition 9.5 (Eigen Values and Eigen Functions) \(\lambda \in \mathcal R\) is said to be an eigen value of \(T_K\) if \(\exists f \in \mathcal L^2(p)\), s.t \(T_Kf = \lambda f\).
In other words, if \(\lambda f(\cdot) = \int_{\mathcal R^d} K(\cdot, \tilde x) f(\tilde x) p(\tilde x) d \tilde x\), we call \((\lambda, f)\) and eigen pair.
Theorem 9.2 (Mercer's Theorem) There exists a sequence of eigen pairs \(\{(\lambda_k, \phi_k)\}_{k \in \mathcal N}\) of \(T_K\) s.t
\(<\phi_k, \phi_l)>_{\mathcal L^2(p)} = \int_{\mathcal R^d} \phi_k(x) \phi_l(x) p(x) dx = \begin{cases}1 &k=l\\0&o.w.\end{cases}\).
\(\mathcal L^2(p) = \{\sum_{i=1}^\infty a_i \phi_i: \sum_{i=1}^\infty a_i^2 < \infty\}\) and \(f \in \mathcal L^2(p)\) can be represented as \(\sum_{i=1}^\infty <f, \phi_i>_{\mathcal L^2(p)} \phi_i\) and the inner product can be written as \(<f,g>_{\mathcal L^2(p)} = \sum_{i=1}^\infty <f, \phi_i>_{\mathcal L^2(p)} < g, \phi_i>_{\mathcal L^2(p)}\).
The numbers \(\lambda_1, \lambda_2, \cdots\) satisfy \(\lambda_1 \geq \lambda_2 \geq \cdots \geq 0\) and \(\lambda_n \to 0\) as \(n \to \infty\).
The kernel \(K\) can be written as \(K(x, \tilde x) = \sum_{i=1}^\infty \lambda_i \phi_i(x) \phi_i(\tilde x)\).
Proof.
- First, note that \(K(\cdot, x) \in \mathcal L^2(p)\):
\[ \begin{aligned} \int_{\mathcal R^d} (K(\tilde x, x))^2 p(\tilde x) d\tilde x &\leq \int_{\mathcal R^d} [\sqrt{K(\tilde x, \tilde x)} \sqrt{K(x,x)}]^2 p(\tilde x) d\tilde x\\ &= K(x,x) \int_{\mathcal R^d} K(\tilde x, \tilde x) p(\tilde x) d\tilde x\\ &< \infty \end{aligned} \]
In particular, \(K(\cdot, x) = \sum_{i=1}^\infty <K(\cdot,x), \phi_i>_{\mathcal L^2(p)} \phi_i(\cdot)\). Thus, \(K(\tilde x, x) = \sum_{i=1}^\infty <K(\cdot,x), \phi_i>_{\mathcal L^2(p)} \phi_i(\tilde x)\). We just need to show that \(<K(\cdot,x), \phi_i>_{\mathcal L^2(p)} = \lambda_i \phi_i(x)\):
\[ \begin{aligned} <K(\cdot,x), \phi_i>_{\mathcal L^2(p)} &= \int_{\mathcal R^d} K(\tilde x, x) \phi_i(\tilde x) p(\tilde x) d \tilde x\\ &= T_K \phi_i(x)\\ &= \lambda_i \phi_i(x) \end{aligned} \]
- How to define a Gaussian Process over \(\mathcal X\) with covariance function \(K\)? How to do sampling?
Assuming \(K\) satisfies the condition for Mercer’s Theorem for some density \(p\), we consider the eigen pairs \(\{(\lambda_i, \phi_i)\}\) and do the following:
Let \(\xi_1, \cdots\) be i.i.d \(N(0,1)\) random variables. We define a random function: \(Z = \sum_{i=1}^\infty \sqrt{\lambda_i} \xi_i \phi_i\). In particular, \(Z(x) = \sum_{i=1}^\infty \sqrt{\lambda_i} \xi_i \phi_i(x_i)\).
It follows that,
\[ \begin{aligned} Cov(Z(x), Z(\tilde x)) &= E[\sum_{i=1}^\infty \sum_{j=1}^\infty \sqrt{\lambda_i} \sqrt{\lambda_j} \xi_1 \xi_j \phi_i(\tilde x)\phi_i(x)]\\ &= \sum_{i=1}^\infty \sum_{j=1}^\infty \sqrt{\lambda_i} \sqrt{\lambda_j} \phi_i(\tilde x)\phi_i(x)E[\xi_1 \xi_j]\\ &\underset{E[\xi_i \xi_j] = 1_{i=j}}{=} \sum_{i=1}^\infty\lambda_i \phi_i(x) \phi_i(\tilde x)\\ &= K(x, \tilde x) \end{aligned} \]
Theorem 9.3 (Karhunen-Loewe Expansion) The representation: \(Z = \sum_{i=1}^\infty \sqrt{\lambda_i} \xi_i \phi_i\) for \(Z \sim N(0, K)\) is known as Karhunen-Loewe Expansion.
Example 9.7 Let \(\mathcal X = [0,1]\), \(K(s,t) = min\{s,t\} - st\) is a kernel. Consider \(p(t) = 1\), \(t \in [0,1]\). Consider the operator: \(T_K: f \in \mathcal L^2(p) \to T_Kf \in \mathcal L^2(p)\), where \(T_K f(\cdot) = \int_0^1 K(\cdot, s) f(s) ds\).
Let us try to find the eigen pairs of \(T_K\). That is,
\[ \begin{aligned} \lambda \phi(t) &= \int_0^1 K(t,s) \phi(s) ds\\ &= \int_0^1 (min\{s,t\} - st) \phi(s) ds\\ &= \int_0^1 min\{s,t\} \phi(s) ds - t\int_0^1 s \phi(s) ds\\ &= \int_0^t s\phi(s) ds + t\int_t^1 \phi(s) ds - t\int_0^1 s \phi(s) ds\\ &:= \int_0^t s\phi(s) ds + t\int_t^1 \phi(s) ds - tC \end{aligned} \]
Take derivative w.r.t. \(t\) to get
\[ \begin{aligned} \lambda \phi'(t) &= t\phi(t) + [\int_t^1 \phi(s) ds - t\phi(t)] - C\\ & = \int_t^1 \phi(s) ds - C \end{aligned} \]
Take derivative w.t.t. \(t\) again to derive:
\[ \lambda \phi''(t) = - \phi(t) \]
For reasons that will soon become apparent (\(\phi \in \mathcal H\)), we must have:
\[ \phi(0) = 0 = \phi(1) \]
Thus, we have the equation:
\[ \begin{cases} \lambda \phi''(t) + \phi(t) = 0, ~\forall t \in (0,1)\\ \phi(0) = 0\\ \phi(1) = 0 \end{cases} \]
Thus, \(\phi\) has to take the form:
\[\phi(t) = A cos(\frac{t}{\sqrt{\lambda}}) + B sin(\frac{t}{\sqrt \lambda})\]
However, we need to make sure \(\phi(0) = 0 = \phi(1)\), which means that \(0 = \phi(0) = A\), and \(0 = \phi(1) = B sin(\frac{1}{\sqrt \lambda})\), which forces \(\frac{1}{\sqrt \lambda}\) to be a multiple of \(\pi\).
In other words: \(\frac{1}{\sqrt \lambda} = k\pi\), i.e. \(\lambda = \frac{1}{k^2 \pi^2}\), \(k \in \mathcal N\).
Thus, we have
\[ \begin{cases} \lambda_k = \frac{1}{k^2 \pi^2}\\ \phi_k = C_k sin (k\pi t) \end{cases} \]
where \(C_k\) satisfies \(\int_0^1 \phi^2(x) dx = 1\).
Remark. Let’s go back to the setting of Mercer’s Theorem:
\(K\) a continuous, bounded kernel.
\(p\) a density.
\(\{(\lambda_k, \phi_k)\}_{k \in \mathcal N}\) are eigen pairs of \(T_K\).
\(\mathcal L^2(p)\) can be alternatively written as: \(\mathcal L^2(p) = \{\sum_{i=1}^\infty : \sum_{i=1}^\infty a_i^2 < \infty\}\).
The RKHS \(\mathcal H\) of \(K\) is defined as
\[ \mathcal H := \{\sum_{i=1}^\infty b_i K(\cdot, x_i): \{x_i\} \subseteq \mathcal X, \{b_i\} \subseteq \mathcal R, \text{ s.t } \sum_{j=1}^\infty \sum_{i=1}^\infty K(x_i, x_j) b_i b_j < \infty\} \]
As
\[ \begin{aligned} \sum_{i=1}^\infty b_i K(\cdot, x_i) &= \sum_{i=1}^\infty b_i \sum_{k=1}^\infty \lambda_k \phi_k(\cdot) \phi_k(x_i)\\ &= \sum_{k=1}^\infty \lambda_k (\sum_{i=1}^\infty b_i\phi_k(x_i)) \phi_k(\cdot)\\ &= \sum_{k=1}^\infty <\sum_{i=1}^\infty b_i K(\cdot, x_i), \phi_k>_{\mathcal L^2(p)} \phi_k(\cdot)\\ &:= \sum_{k=1}^\infty a_k \phi_k(\cdot), \end{aligned} \]
we can write it as:
\[\mathcal H := \{\sum_{k=1}^\infty a_k \phi_k(\cdot): \{x_i\} \subseteq \mathcal X, \{b_i\} \subseteq \mathcal R, \text{ s.t } \sum_{i=1}^\infty \frac{a_i^2}{\lambda_i} < \infty.\]
In particular,
\(\mathcal H \subseteq \mathcal L^2(p)\)
\(\phi_i \in \mathcal H\), \(\forall i\).
\(Z = \sum_{i=1}^\infty \sqrt{\lambda_i} \xi_i \phi_i \notin \mathcal H\), although it does belong to \(\mathcal L^2(p)\).