How to turn a collection of small building blocks into a versatile tool for solving regression problems.

Even if you have spent some time reading about machine learning, chances are that you have never heard of Gaussian processes. And if you have, rehearsing the basics is always a good way to refresh your memory. With this blog post we want to give an introduction to Gaussian processes and make the mathematical intuition behind them more approachable.

Gaussian processes are a powerful tool in the machine learning toolbox
*fitting* a function to the data.
For a given set of training points, there are potentially infinitely many functions that fit the data. Gaussian
processes offer an elegant solution to this problem by assigning a probability to each of these functions. The mean
of this probability distribution then represents the most probable characterization of the data. Furthermore,
using a probabilistic approach allows us to incorporate the confidence of the prediction into the regression result.

We will first explore the mathematical foundation they are built on, and you can follow along using interactive figures and hands-on examples. They help to explain the impact of individual components, and show the flexibility of Gaussian processes. After following this article we hope that you will have a visual intuition on how Gaussian processes work and how you can configure them for different types of data.

Before we can explore Gaussian processes, we need to understand the mathematical concepts they are based on. As the
name suggests, the Gaussian distribution (which is often also referred to as *normal* distribution) is the
basic building block of Gaussian processes. In particular, we are interested in the multivariate case of this
distribution, where each random variable is distributed normally and their joint distribution is also Gaussian. In
general, the multivariate Gaussian distribution is defined by a mean vector

The mean vector

We say

Visually, the distribution is centered arround the mean and the covariance matrix defines its shape. The following figure shows the influence of these parameters on a two-dimensional Gaussian distribution. The standard deviations for each random variable are on the diagonal of the covariance matrix, while the other values show the covariance between them.

Gaussian distributions are widely used to model the real world: either as a surrogate when the original
distributions are unknown, or in the context of the *central limit theorem*

Gaussian distributions have the nice algebraic property of being closed under conditioning and marginalization. This means that the resulting distributions from these operations are also Gaussian, which makes many problems in statistics and machine learning tractable. In the following we will take a closer look at both of these operations, as they are the foundation for Gaussian processes.

*Marginalization* and *conditioning* both work on subsets of the original distribution and we will use the following notation:

With

Through *marginalization* we can extract partial information from multivariate probability distributions. In particular, given a normal probability distribution

The interpretation of this equation is straight forward: each partition

The way to interpret this equation is that if we are interested in the probability of

Another important operation for Gaussian processes is *conditioning*.
It is used to determine the probability of one variable depending on another variable.
Similar to marginalization, this operation is also closed and yields a modified Gaussian distribution.
This operation is the cornerstone of Gaussian processes since it allows Bayesian inference, which we will talk about in the next section.
Conditioning is defined by:

Note that the new mean only depends on the conditioned variable, while the covariance matrix is independent from this variable.

Now that we have worked through the necessary equations, we will think about how we can understand the two operations visually. While marginalization and conditioning can be applied to multivariate distributions of many dimensions, it makes sense to consider the two-dimensional case as shown in the following figure. Marginalization can be seen as integrating along one of the dimensions of the Gaussian distribution, which is in line with the general definition of the marginal distribution. Conditioning also has a nice geometric interpretation — we can imagine it as making a cut through the multivariate distribution, yielding a new Gaussian distribution with fewer dimensions.

Now that we have recalled some of the basic properties of multivariate Gaussian distributions, we will combine them together to define Gaussian processes, and show how they can be used to tackle regression problems.

First, we will move from the continuous view to the discrete representation of a function:
rather than finding an implicit function, we are interested in predicting the function values at concrete points, which we call *test points* *training data* as

In order to perform regression on the training data, we will treat this problem as *Bayesian inference*.
The essential idea of Bayesian inference is to update the current hypothesis as new information becomes available.
In the case of Gaussian processes, this information is the training data.
Thus, we are interested in the conditional probability

Now that we have the basic framework of Gaussian processes together, there is only one thing missing:
how do we set up this distribution and define the mean *kernel*

In Gaussian processes we treat each test point as a random variable.
A multivariate Gaussian distribution has the same number of dimensions as the number of random variables.
Since we want to predict the function values at

Recall that in order to set up our distribution, we need to define

The clever step of Gaussian processes is how we set up the covariance matrix *covariance function*, pairwise on all the points.
The kernel receives two points

We evaluate this function for each pairwise combination of the test points to retrieve the covariance matrix.
This step is also depicted in the figure above.
In order to get a better intution for the role of the kernel, let's think about what the entries in the covariance matrix describe.
The entry

Kernels are widely used in machine learning, for example in *support vector machines*

Kernels can be separated into *stationary* and *non-stationary* kernels. *Stationary* kernels, such
as the RBF or the periodic kernel, are functions invariant to translations, and the covariance of two points is only
dependent on their relative position. *Non-stationary* kernels, such as the linear kernel, do not have this
constraint and depend on an absolute location. The stationary nature of the RBF kernel can be observed in the
banding around the diagonal of its covariance matrix (as shown in this figure). Increasing the length parameter increases the banding, as
points further away from each other become more correlated. For the periodic kernel, we have an additional parameter

There are many more kernels that can describe different classes of functions, which can be used to model the desired shape of the function.
A good overview of different kernels is given by Duvenaud

We will now shift our focus back to the original task of regression.
As we have mentioned earlier, Gaussian processes define a probability distribution over possible functions.
Because this distribution is a multivariate Gaussian distribution, the distribution of functions is normal.
Recall that we usually assume *prior* distribution

If we have not yet observed any training examples, this distribution revolves around

In the previous section we have looked at examples of different kernels. Since the kernel is used to define the entries of the covariance matrix, it also determines which type of functions from the space of all possible functions are more probable. As the prior distribution does not yet contain any additional information, it is perfect to visualize the influence of the kernel on the distribution of functions. The following figure shows samples of potential functions from prior distributions that were created using different kernels:

Adjusting the parameters allows you to control the shape of the resulting functions.
This also varies the confidence of the prediction.
When decreasing the variance *Linear* kernel, setting the variance

So what happens if we observe training data?
Let's get back to the model of Bayesian inference, which states that we can incorporate this additional information into our model, yielding the *posterior* distribution

First, we form the joint distribution

For the next step we need one operation on Gaussian distributions that we have defined earlier.
Using *conditioning* we can find

Analogous to the prior distribution, we could obtain a prediction by sampling from this distribution. But, since sampling involves randomness, we would have no guarantee that the result is a good fit to the data. In order to make a better prediction we can use the other basic operation of Gaussian distributions.

Through *marginalization* of each random variable, we can extract the respective mean function value

The following figure shows an example of the conditional distribution.
At first, no training points have been observed.
Accordingly, the mean prediction remains at

The training points can be activated by clicking on them, which leads to a constrained distribution. This change is reflected in the entries of the covariance matrix, and leads to an adjustment of the mean and the standard deviation of the predicted function. As we would expect, the uncertainty of the prediction is small in regions close to the training data and grows as we move further away from those points.

In the constrained covariance matrix, we can see that the correlation of neighbouring points is affected by the training data. If a predicted point lies on the training data, there is no correlation with other points. Therefore, the function must pass directly through it. Predicted values further away are also affected by the training data — proportional to their distance.

As described earlier, the power of Gaussian processes lies in the choice of the kernel function. This property allows an expert to introduce domain knowledge into the process and lends Gaussian processes their flexibility to capture trends in the training data. For example, by choosing a suitable bandwidth for the RBF kernel, an expert can control how smooth the resulting function will be.

A big benefit that kernels provide is that they can be combined together, resulting in a more specialized kernel.
This gives domain experts the ability to include further information, leading to a more accurate prediction.
The usual way to combine kernels is to multiply them with each other.
Let's consider two kernels, an RBF kernel

In the figure below, the original training data has an ascending trend with a periodic deviation.
Using only a linear kernel, it is possible to achieve a normal linear regression of the points.
At first sight the RBF kernel accurately approximates the points.
But since the RBF kernel is stationary it will always return to

With this article, you should have obtained an overview of Gaussian processes, and developed a deeper
understanding on how they work.
As we have seen, Gaussian processes offer a flexible framework for regression and several extensions exist that
make them even more versatile.
If we need to deal with real-world data, we will often find measurements that are afflicted with uncertainty and
errors. Using Gaussian processes, we can define a kernel function that fits our data and add uncertainty to the
prediction.
For example, one particular extension to Gaussian processes from McHutchon et al.

Even though we mostly talk about Gaussian processes in the context of regression, they can be adapted for
different purposes, e.g. *model-peeling* and hypothesis testing.
By comparing different kernels on the dataset, domain experts can introduce additional knowledge through
appropriate combination and parameterization of the kernel.
As this might not be possible in many cases, learning specialized kernel functions from the underlying data using
deep learning

If we have sparked your interest, we have compiled a list of further blog posts on the topic of Gaussian processes. In addition, we have linked two Python notebooks that will give you some hands-on experience and help you to get started right away.

We are very grateful to Carla Avolio and Marc Spicker for their feedback on the manuscript. In addition, we want to thank Jonas Körner for helping with the implementation of the figure explaining the multivariate Gaussian distribution. Furthermore, we would like to thank the German Research Foundation (DFG) for financial support within project A01 of the SFB-TRR 161 and within the Research Unit FOR 2111 with grant number DFG-431/16.

The following blog posts offer more interactive visualizations and further reading material on the topic of Gaussian processes:

- Gaussian process regression demo by Tomi Peltola
- Gaussian Processes for Dummies by Katherine Bailey
- Intuition behind Gaussian Processes by Mike McCourt
- Fitting Gaussian Process Models in Python by Chris Fonnesbeck

If you want more of a hands-on experience, there are also many Python notebooks available:

- Fitting Gaussian Process Models in Python by Chris Fonnesbeck
- Gaussian process lecture by Andreas Damianou