Gaussian processes provide a principled and tractable approach to solving supervised and unsupervised machine learning problems. Broadly, machine learning problems can be classfied as
At the core of GP is the multivariate Gaussian distribution. GP uses the fact that any $n$-observations, $(x_1, …, x_n)$, in a data set is a point sampled from an $n$-variate Gaussian distribution. It should be emphasized that GP is a non-parametric approach to the problem of automated inference. Here, we fit that data to all the functions! This is in contrast to methods where parameters of a given function are estimated based on the data.
GP defines prior over functions (multivariate Gaussian distributions), using a suitable kernel function. The training set is then used to obtain the posterior distribution from the prior. The above systematic procedure makes GP a tractable framework for various machine learning problems. The predictions in GP are probabilistic, and thus, quantities like mean, standard deviation, and confidence intervals follow naturally. We explore these in more detail below.
A multivariate Gaussian distribution is completely specified by its mean $\boldsymbol\mu$ and matrix of covariances $\boldsymbol\Sigma$.
For a $n$-dimensional vector $\boldsymbol X$, the distribution is written as
$$ \boldsymbol{X}\sim{N}\left(\boldsymbol{\mu},\boldsymbol{\Sigma}\right). $$The covariance matrix $\boldsymbol\Sigma$ is an $n\times n$ matrix for a n-variate Gaussian distribution, while the mean is a $n$-dimensional vector. Their formal expressions are
$$ {\displaystyle {\boldsymbol {\mu }}=\operatorname {E} [\mathbf {X} ]},\qquad {\displaystyle {\boldsymbol {\Sigma }}=:\operatorname {E} [(\mathbf {X} -{\boldsymbol {\mu }})(\mathbf {X} -{\boldsymbol {\mu }})^{\rm {T}}]=[\operatorname {Cov} [X_{i},X_{j}];1\leq i,j\leq k].} $$The explicit form of the multivariate Gaussian distribution is
$$ {\displaystyle f(\boldsymbol X\mid \boldsymbol\mu ,\boldsymbol\Sigma)={\frac {1}{\sqrt {(2\pi)^{n} |\boldsymbol\Sigma|}}}\exp\Big\{-{\frac{1}{2} {(\boldsymbol X-\boldsymbol \mu )^T\,\boldsymbol\Sigma^{-1}\, (\boldsymbol X-\boldsymbol \mu ) }}}\Big\}. $$%matplotlib inline
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d.axes3d import Axes3D
from matplotlib import cm
f = plt.figure(num=None, figsize=(15, 10)); ax = f.add_subplot(1, 1, 1, projection='3d');
x, y = np.meshgrid(np.linspace(-4, 4, 64), np.linspace(-4, 4, 64));
r = np.sqrt(x**2 + y**2); z = .5*(np.exp(-r*r/4));
ax.plot_surface(x, y, z, rstride=1, cstride=1, cmap=cm.RdBu, linewidth=0);
plt.axis('off');plt.title('Bivariate Gaussian', fontsize=20);
Thus, we have joint probability distribution of many variables. Using Bayes theorem, it is possible to derive posterior for a set of test variables from prior constructed out of the training points. Let's assume that we have a training set $(y,x)$ and we want to predict $(y_*,x_*)$. So the joint probability will look like
$$ \begin{pmatrix}y \\ y_* \end{pmatrix} = N\bigg( \begin{Bmatrix}\mu \\ \mu_* \end{Bmatrix}, \begin{Bmatrix}K & K_*\\ K_*^T & K_{**} \end{Bmatrix} \bigg). $$Here $K$ is a kernel matrix given as
$$ K = \begin{pmatrix} k(x_1, x_1) \quad k(x_1, x_2) \quad\ldots \quad k(x_1, x_n) \\ k(x_2, x_1) \quad k(x_2, x_2) \quad\ldots \quad k(x_2, x_n) \\ \vdots\quad\qquad \vdots\quad\qquad\vdots \quad\qquad\vdots \\ k(x_n, x_1) \quad k(x_n, x_2) \quad\ldots \quad k(x_n, x_n) \end{pmatrix}, $$where we choose $k(x, x')$ to be
$$ k(x, x') = \alpha_1\exp\left(-\tfrac{\alpha_2}{2}(x-x')^2\right). $$The above exponential choice of kernel function is popular as it has desirable properties:
So we have a prior distribution over the training set and we seek to obtain the posterior $y_* | y$. It is given as
$$ y_* | y \sim N(\mu^*, \Sigma^*). $$The mean is the best estimate of the posterior of $y_*$. It comes out to be
$$ \mu_* = K^T_*K^{-1}y, $$while the variance gives the uncertainity
$$ \Sigma^* = K_{**}-K^T_* K^{-1}K_*. $$Explicitly, the posterior is then
$$ y_* = \mu_* + L N(0, I). $$Here $L$ is the Cholesky factor of $\Sigma_*$, such that $LL^T = \Sigma_*$. The above analysis is implemented in the code below.
%matplotlib inline
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
class GPR:
def __init__(self, nS, nT, iP, nP, Lx, xS, xT, yS, yT, yP, K, Ks, Kss, mu, sd):
self.nS = nS # # of test data points
self.nT = nT # # of training data points
self.iP = iP # # inverse of sigma
self.nP = nP # # number of priors
self.Lx = Lx # size of domain
self.xS = xS # training input
self.xT = xT # training output
self.yS = yS # test input
self.yT = yT # test output
self.yP = yP # prior output
self.K = K # kernel
self.Ks = Ks # kernel
self.Kss = Kss # kernel
self.mu = mu # mean
self.sd = sd # stanndard deviatio
def calcDistM(self, r, s):
'''Calculate distance matrix between 2 1D arrays'''
return r[..., np.newaxis] - s[np.newaxis, ...]
def calcKernels(self):
'''Calculate the kernel'''
cc = self.iP*0.5
self.K = np.exp(-cc*self.calcDistM(xT, xT)**2)
self.Ks = np.exp(-cc*self.calcDistM(xT, xS)**2)
self.Kss = np.exp(-cc*self.calcDistM(xS, xS)**2)
return
def calcPrior(self):
'''Calculate the prior'''
L = np.linalg.cholesky(self.Kss + 1e-6*np.eye(self.nS))
G = np.random.normal(size=(self.nS, self.nP))
yP = np.dot(L, G)
return
def calcMuSigma(self):
'''Calculate the mean'''
self.mu = np.dot(self.Ks.T, np.linalg.solve(self.K, self.yT))
vv = self.Kss - np.dot(self.Ks.T, np.linalg.solve(self.K, self.Ks))
self.sd = np.sqrt(np.abs(np.diag(vv)))
# Posterior
L = np.linalg.cholesky(vv + 1e-6*np.eye(self.nS))
self.yS = self.mu.reshape(-1,1) + np.dot(L, np.random.normal(size=(self.nS, self.nP)))
return
def plotResults(self):
plt.plot(self.xT, self.yT, 'o', ms=10, mfc='#348ABD', mec='none', label='training set' )
plt.plot(self.xS, self.yS, '#dddddd', lw=1.5, label='posterior')
plt.plot(self.xS, self.mu, '#A60628', lw=2, label='mean')
# fill 95% confidence interval (2*sd about the mean)
plt.fill_between(self.xS.flat, self.mu-2*self.sd, self.mu+2*self.sd, color="#348ABD", alpha=0.4, label='2 sigma')
plt.axis('tight'); plt.legend(fontsize=15); plt.rcParams.update({'font.size':18})
def runGPR(self):
self.calcKernels()
self.calcPrior()
self.calcMuSigma()
self.plotResults()
The above can now be used for different training sets to obtain posterior distributions. Two illustrative examples are discussed here.
#parameters
nS, Lx = 64, 0.5*np.pi
iP, nP = 10.0 , 1
xS = np.linspace(0, Lx, nS)
yS = xS*0; yP = xS*0
K = 0; Ks=0; Kss=0; mu=0;sd=0;
f = plt.figure(figsize=(20, 14));
## Training set #1
nT = 4
xT = np.array([0, .2, .3, .9])
yT = np.cos(xT)
sp = f.add_subplot(2,2,1);
plt.title('%s training points'%nT, fontsize=20)
gp = GPR(nS, nT, iP, nP, Lx, xS, xT, yS, yT, yP, K, Ks, Kss, mu,sd)
gp.runGPR()
## Training set #2
nT = 4
xT = np.linspace(0, Lx, nT);
yT = np.cos(xT)
sp = f.add_subplot(2,2,2);
plt.title('%s training points'%nT, fontsize=20)
gp = GPR(nS, nT, iP, nP, Lx, xS, xT, yS, yT, yP, K, Ks, Kss, mu,sd)
gp.runGPR()
## Training set #3
nT = 8
xT = np.linspace(0, Lx, nT);
yT = np.cos(xT)
sp = f.add_subplot(2,2,3);
plt.title('%s training points'%nT, fontsize=20)
gp = GPR(nS, nT, iP, nP, Lx, xS, xT, yS, yT, yP, K, Ks, Kss, mu,sd)
gp.runGPR()
## Training set #4
nT = 16
xT = np.linspace(0, Lx, nT);
yT = np.cos(xT)
sp = f.add_subplot(2,2,4);
plt.title('%s training points'%nT, fontsize=20)
gp = GPR(nS, nT, iP, nP, Lx, xS, xT, yS, yT, yP, K, Ks, Kss, mu,sd)
gp.runGPR()
We note that the prediction improves as more training points are used. It is also instructive to note that for the same number of training points, the prediction depends on their spacing. In the first panel, there is a good fit in the range where training points are closely spaced, while a bad fit in the other region. In the second panel, on the other hand, these points are evenly spaced and there is a good fit everywhere.
Let's now have a look at a data obtained from a function which has more wiggles.
#parameters
nS, Lx = 256, 4.8*np.pi
iP, nP = 10.0 , 1
xS = np.linspace(-Lx, Lx, nS)
yS = xS*0; yP = xS*0
K = 0; Ks=0; Kss=0; mu=0;sd=0;
f = plt.figure(figsize=(20, 14));
## Training set #1
nT = 16
xT = np.linspace(-Lx, Lx, nT)
yT = np.sin(xT)/xT
sp = f.add_subplot(2,2,1);
plt.title('%s training points'%nT, fontsize=20)
gp = GPR(nS, nT, iP, nP, Lx, xS, xT, yS, yT, yP, K, Ks, Kss, mu,sd)
gp.runGPR()
## Training set #2
nT = 32
xT = np.linspace(-Lx, Lx, nT)
yT = np.sin(xT)/xT
sp = f.add_subplot(2,2,2);
plt.title('%s training points'%nT, fontsize=20)
gp = GPR(nS, nT, iP, nP, Lx, xS, xT, yS, yT, yP, K, Ks, Kss, mu,sd)
gp.runGPR()
## Training set #3
nT = 64
xT = np.linspace(-Lx, Lx, nT)
yT = np.sin(xT)/xT
sp = f.add_subplot(2,2,3);
plt.title('%s training points'%nT, fontsize=20)
gp = GPR(nS, nT, iP, nP, Lx, xS, xT, yS, yT, yP, K, Ks, Kss, mu,sd)
gp.runGPR()
## Training set #4
nT = 128
xT = np.linspace(-Lx, Lx, nT)
yT = np.sin(xT)/xT
sp = f.add_subplot(2,2,4);
plt.title('%s training points'%nT, fontsize=20)
gp = GPR(nS, nT, iP, nP, Lx, xS, xT, yS, yT, yP, K, Ks, Kss, mu,sd)
gp.runGPR()