Intro to Gaussian Processes (GPs)
What is a Gaussian Process model?
From Richard McElreath youtube presentation.
Give me some maths, please:
I retrieve the maths text and description from Trent Henderson’s blog post: Interpretable time-series modelling using Gaussian processes, Herb Susmann’ notebook Derivatives of a Gaussian Process and Brendan Hasz’s blog post :
Univariate Gaussian regression:
As you may recall the univariate gaussian regression is defined by (i) mean \(\mu\) ; and (ii) variance \(\sigma^2\) \[ \mathcal{N}(\mu, \sigma^2) \]
Multivariate Gaussian regression
Then we can increase the dimensionality of this univariate case to describe multivariate Gaussian regressions, by increasing the dimensionality of the parameters. This results in our mean becoming a mean vector \(\mu\) and our variance becoming a covariance matrix \(\boldsymbol{\Sigma}\) (as we now need to describe both the variance of each variable and the covariance between them), written as:
\[ \mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}) \]
For example, we might jointly model plankton cell size, chlorophyll content, and plankton growth rate across lakes.
\[ \mathbf{X} = \begin{bmatrix} \text{Cell size} \\ \text{Chl content} \\ \text{Growth rate} \end{bmatrix} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}) \]
\[ \boldsymbol{\mu} = \begin{bmatrix} 15 \\ 2.5 \\ 0.8 \end{bmatrix}, \quad \boldsymbol{\Sigma} = \begin{bmatrix} 4 & 0.6 & -0.1 \\ 0.6 & 0.2 & 0.05 \\ -0.1 & 0.05 & 0.01 \end{bmatrix} \]
Gaussian process regression:
A GP regression is just a generalisation of this idea to an infinite number of dimensions. This means our mean vector and covariance matrix become a mean function* \(m(x)\) and covariance function \(k(x, x')\) (also called a ‘kernel’). The covariance function controls the smoothness of realizations from the Gaussian process and the degree of shrinkage towards the mean. Putting this all together, we have the general form of a GP:
\[ f(x) \sim \mathcal{GP}(m(x), k(x, x')) \]
A popular choice for the covariance kernel function \(k(x,x')\) is the squared exponential kernel (a.k.a the Gaussian kernel or the radial basis function kernel).
For two data points which have \(x\) values \(x_i\) and \(x_j\), the squared exponential kernel value between them is: \[k(x_i, x_j) = \alpha^2 \exp\left(-\frac{(x_i - x_j)^2}{2\ell^2}\right)\]
where \(\alpha\) is the variance and \(\ell\) is a length-scale parameter which controls the strength of the association between points as they become farther apart. Learning one length-scale per predictor variable is sometimes referred to as ‘automatic relevance determination’ (ARD), because ‘irrelevant’ predictor variables can be assigned a small length-scale, reducing their influence. Recall that in the Empirical Dynamic Model (EDM) literature (Munch et al. 2017) and Tanya Rogers vignettes this length-scale \(\ell\) parameter is called \(\phi\) (phi).
Question:
But why should we use the GP regression within the EDM framework?
Let us say we have an ecosystem consisting of \(M\) different “state variables.” The state variables could represent the population density of \(M\) different species, or the concentrations of \(M\) different nutrients. They could also represent the density of just one species in \(M\) different locations. More likely, the state variables represent some combination of population densities, nutrients, and abiotic factors in several different locations.
If we use \(x_{1,t}, x_{2,t}, \dots, x_{M,t}\) to represent the value of all state variables at time \(t\), then the vector
\[ \mathbf{x}_t = \{ x_{1,t}, x_{2,t}, \dots, x_{M,t} \} \]
represents the state of the system. As the system changes through time, the result is a trajectory through this state space.
Apart from transients and random perturbations, the trajectory for most systems of interest will converge to an attractor, e.g., a point, a closed loop, or a more complex shape. We can describe the dynamics of the system in discrete time by a set of coupled equations. Since there are \(M\) state variables, there are \(M\) different “functions,” i.e., discrete-time models, each of which is a function of the current system state.
That is,
\[ x_{1,t+1} = F_1(\mathbf{x}_t), \quad x_{2,t+1} = F_2(\mathbf{x}_t), \quad \dots \]
If we have data on all of the state variables over a wide enough range of values, we can empirically estimate the functions \(F_1(\mathbf{x}_t), F_2(\mathbf{x}_t), ...\) from the data.
And that is where the GP regression comes in:
The \(F_i(\mathbf{x}_t)\) function can be estimated using any number of flexible, non-parametric regression approaches such as the piecewise constant “Simplex” (Sugihara and May 1990), local weighted linear regression “S-Map” (Sugihara 1994), splines (Ellner and Turchin 1995) or the Gaussian Process regression (Munch et al. 2017).
Full GP-EDM model specification:
If we let \(x_t\) represent the species abundance at time \(t\) and consider it as a random function dependent on lag embeddings of itself (i.e., \(x_{t-1}, \ldots, x_{t-m}\)) for up to \(m\) years, where \(m\) represents the maximum number of time lags utilized.
That is, \(x_t = f(x_{t-1}, \ldots, x_{t-m}) + \varepsilon_t\), where \(f\) is a random function that approximates the dynamical process and \(\varepsilon_t\) is the process error.
Then, we employ the GP regression prior to approximate the delay-embedding map \(f\) (Munch et al. 2017). The inputs, gathered in a vector \(X_{t-m} = \{x_{t-1}, \ldots, x_{t-m}\}\), are incorporated into a Bayesian GP model formulated as follows (retrieved from Tsai et al. 2024):
\[ P(x_t \mid f, X_{t-m}, \phi, \tau, V_e) \sim \mathcal{N}\big(f(X_{t-m}), V_e\big), \]
\[ P(f \mid \phi, \tau, V_e) \sim \mathcal{GP}(0, k(\cdot, \cdot))\quad \textcolor{red}{\text{(GP prior)}} \]
\[ P(\phi) \sim \text{HalfNormal}\!\left(\pi \sqrt{\tfrac{1}{2}}\right), \]
\[ P(\tau) \sim \text{Beta}(1.1, 1.1), \]
\[ P(V_e) \sim \text{Beta}(1.1, 1.1). \tag{1} \]Where, the first layer represents the probability density of observing the species abundance at time \(t\) given the function approximation \(f\), the input data of time lags \(X_{t-m}\), and the parameters \(\phi\), \(\tau\), and \(V_e\), where \(V_e\) represents the process noise. The second layer represents the probability density of the function approximation \(f\) given the parameters. The unknown function \(f\) is assigned a Gaussian process prior which generalizes the multivariate normal distribution with mean zero and covariance function \(k(\cdot, \cdot)\).
The covariance function is a tensor product of squared exponential covariances for each input with pointwise variance \(\tau^2\) and inverse length scales \(\phi = \{\phi_1, \ldots, \phi_m\}\) where the index \(i\) is from \(1\) to \(m\) (the maximum number of time lags). Specifically, the covariance function is
\[ k(X_t, X_s) = \tau^2 \prod_{i=1}^m \exp\big[-\phi_i (x_{t-i} - x_{s-i})^2\big], \]
where \(X_t\) and \(X_s\) are delay coordinate vectors for years \(t\) and \(s\). The final layer of the Bayesian model specifies priors for the hyperparameters \(\{\phi_1, \ldots, \phi_m, \tau, V_e\}\).
We encourage sparsity in the fitted model by assigning a half-normal prior with variance \(\pi \sqrt{\tfrac{1}{2}}\) for \(\phi_i\). This prior choice asserts that \(f\) has, on average, one local extreme value within the data range, and the mode at zero encourages uninformative lags to drop out of the model (i.e. \(\phi_i \to 0\)). This prior is widely used in GP regression under the name automatic relevance determination (ARD). To further encourage sparsity, \(\phi\) is determined as “zero” by rounding to the second decimal place. To improve the interpretability of the inverse length-scale parameter \(\phi\), time series data are centered and standardized to unit variance prior to analyses.
Note that under this prior specification, the variance of a prediction at any input is \(V_e + \tau^2\). Since the data are standardized to unit variance, \(V_e + \tau^2 \approx 1\) should be sufficient. Nearly flat, independent Beta distributions are therefore used as priors for \(V_e\) and \(\tau^2\), allowing up to twice the variance in the data.
Simulate GP regression in R
No more maths and words… now it is time to play and have fun! Let’s code !
Example 1
Let’s write this kernel function in R, and use it to draw samples from a GP. For simplicity, we will fix the mean function \(m(x) = 0\) for all the GPs we work with in this exercise. First, define the kernel function:
Code
# squared exponential kernel
kernel_function <- function(x_i, x_j, alpha, l) {
alpha^2 * exp(-(x_i - x_j)^2 / (2 * l^2))
}Let’s write a helper function that, given a kernel and a set of points, generates a full covariance matrix:
Code
covariance_from_kernel <- function(x, kernel, ...) {
outer(x, x, kernel, ...)
}and a function to draw from a mean-zero GP with a given covariance kernel:
Code
gp_draw <- function(draws, x, Sigma, ...) {
# Create a mean vector 'mu' of all zeros, one for each input location in x
mu <- rep(0, length(x))
# Use mvtnorm::rmvnorm() to draw a number of random functions samples from a multivariate normal dist. with mean mu and covariance Sigma.
mvtnorm::rmvnorm(draws, mu, Sigma)
}Now we can draw 10 samples from a mean-zero GP with a squared-exponential kernel with parameters \(\alpha = 1, \ell = 1)\):
Code
set.seed(1) # reproducibility
n <- 100 # number of points to draw
x <- seq(1, 10, length.out = n) # position of each point
# Kernel parameters
alpha <- 1 # scale parameter
l <- 1 # lenght-scale parameter (phi)
# generate the covariance matrix
Sigma <- covariance_from_kernel(x, kernel_function, alpha = alpha, l = l) # remember that smoothness comes from the squared exponential kernel function we define previously
# Draw 10 random functions
draws <- gp_draw(10, x, Sigma)
# Convert to long dataframe for ggplot
df <- as.data.frame(t(draws))
df1 <- cbind(x, df)
df2 <- df1 |> pivot_longer(cols = -x, # gather all draw columns
names_to = "functions",
values_to = "y")
# plot:
ggplot( data = df2 , aes( x = x , y = y , color = functions))+
geom_line()+
ylab("y = f(x)")+
theme_bw(base_size = 14)If we want to visualize the covariance matrix, here is a function to do that, and the plot for all hyperparameters set to 1:
Code
# generate the covariance matrix with hyperparameters = 1
Sigma <- covariance_from_kernel(x, kernel_function, alpha = 1, l = 1)
# generate the covariance matrix with lenght scake hyperparameters = .2
Sigma2 <- covariance_from_kernel(x, kernel_function, alpha = 1, l = 0.2)Code
#' Code reetrived from :
#' https://hendersontrent.github.io/posts/2024/05/gaussian-process-time-series/
#' Convert covariance matrix to dataframe and plot
#' @param matrix the covariance matrix
#' @return object of class \code{ggplot}
#'
plot_covariance <- function(matrix){
mat_df <- as.data.frame(matrix)
mat_df$x <- seq_len(nrow(mat_df))
mat_df <- reshape(mat_df,
direction = "long",
v.names = "values",
varying = 1:(ncol(mat_df) - 1),
times = 1:nrow(mat_df),
idvar = "x")
p <- ggplot(data = mat_df) +
geom_tile(aes(x = x, y = time, fill = values)) +
labs(title = "Heatmap of covariance matrix",
x = "x",
y = "x",
fill = "k(x,x')") +
scale_fill_viridis_c(limits = c(0, 1),
breaks = seq(from = 0, to = 1, by = 0.2)) +
theme_bw() +
theme(legend.position = "bottom")
return(p)
}Covariance matrix with \(\ell = 1\)
Code
plot_covariance(matrix = Sigma)Covariance matrix with \(\ell = 0.2\)
Code
plot_covariance(matrix = Sigma2)Example 2
Code
# Simulate artificial ecological data: Rabbits (R) and Foxes (F)
# Let's assume fox population depends nonlinearly on rabbit population
set.seed(123)
n <- 100
R <- seq(1, 10, length.out = n) # Rabbit population (independent variable)
F <- 5 * sin(0.3 * R) + sqrt(R) + rnorm(n, 0, 0.3) # Fox population (response), with noise
data <- data.frame(R = R, F = F)Code
# Fit a Gaussian Process regression model
# Goal: Learn a smooth (possibly nonlinear) function: foxes (F) as a function of rabbits (R)
# This model assumes F is generated by a GP: F ~ GP(R)
# 'rbfdot' kernel allows modeling smooth, nonlinear relationships
gp_model <- gausspr(R, F,
data = data,
kernel = "rbfdot", # Radial Basis Function (Gaussian) kernel
var = 0.01) # Small noise term to allow slight deviation (smoothing)Using automatic sigma estimation (sigest) for RBF or laplace kernel
Code
# Predict on a fine grid across rabbit populations
# This lets us see the model behavior across the full range of R
# We want smooth predictions (and derivatives) at many points
R_grid <- seq(min(R), max(R), length.out = 100)
preds <- predict(gp_model, data.frame(R = R_grid), type = "response")To estimate the local interaction strength between species (e.g., how changes in rabbit population ( R ) affect fox population ( F )), we want to compute the partial derivative:
\(\frac{\partial F}{\partial R}\)
This derivative corresponds to a Jacobian element of the system at a given state ( R ).
Since the Gaussian Process (GP) model provides a smooth function ( F = f(R) ), we can use the numerical gradient to estimate this derivative at each point in a grid of rabbit values.
We first create a wrapper function around the GP prediction model:
Code
# To estimate the derivative ∂F/∂R at each point:
# We must pass a *function* to grad() from numDeriv.
# So we create a wrapper around the GP prediction function.
gp_fun <- function(x) {
predict(gp_model, data.frame(R = x), type = "response")
}
# Estimate the Jacobian (local sensitivity ∂F/∂R) using grad()
# grad() computes the numerical derivative of gp_fun at each R_grid point. In other words we apply grad() to each value of R_grid to compute the Jacobian (∂F/∂R)
# This gives us a time- or state-specific interaction strength
jacobian_estimates <- sapply(R_grid, function(r) grad(gp_fun, x = r))
head(jacobian_estimates)[1] 2.927492 2.944788 2.592845 2.041876 1.491004 1.111472
Then we can visualize it:
Code
# Visualization: GP fit and estimated interaction strength
par(mfrow = c(2, 1), mar = c(4, 4, 2, 1))
# Top panel: Observed vs. predicted fox population
plot(R, F, pch = 16, col = "gray", main = "GP Fit: Foxes (F) vs Rabbits (R)",
ylab = "Foxes (F)", xlab = "Rabbits (R)")
lines(R_grid, preds, col = "blue", lwd = 2)
# Bottom panel: Estimated Jacobian (local interaction strength ∂F/∂R)
# Positive values: more rabbits → more foxes
# Negative values: more rabbits → fewer foxes (e.g. saturation)
plot(R_grid, jacobian_estimates, type = "l", col = "red", lwd = 2,
main = "Estimated Local Interaction Strength (∂F/∂R)",
ylab = "Gradient ∂F/∂R", xlab = "Rabbits (R)")
abline(h = 0, lty = 2)Examples from the web
Read this guy’s blog: Dr. Matthijs Hollanders. Why you should default to Gaussian processes for random spatial and temporal effects?
Motorcycle accident acceleration data (link)
Birthdays (link)
Pros and Cons of GP-EDM
Benefits
we can incorporate auxiliary biological and physical information through prior specification
we can incorporate a multilevel / hierarchical structure to share information across different locations (e.g. lakes).
Cons
GP-EDM becomes become noticeably slow with more than several hundred data points, and extremely slow with more than 1000.
The GPEDM package
I am going to reproduce Tanya Roger’s vignettes (link here).
Mathematical foundation
The GP model fit by GPEDM is as described in (Munch et al. 2017). This model uses a mean function of 0 and a squared exponential covariance function with one inverse length scale hyperparameter (phi, \(\phi_i\)) for each input \(i\) (predictor). The values of \(\phi_i\) tell you about the wiggliness of the function with respect to each input: a value of 0 indicates no relationship (the function is flat), low values of \(\phi_i\) indicate a mostly linear relationship (the function is very stiff), and large values of \(\phi_i\) indicate more nonlinear relationships (the function is more wiggly). The model, which is Bayesian, is as follows:
\[ \begin{aligned} p[y_{t}|f,\mathbf{x}_{t},\theta]&\sim N(f(\mathbf{x}_t),V_\epsilon) \\ p[f|\theta]&\sim GP(0,\Sigma) \\ p[\theta] \end{aligned} \] Where theta, (\(\theta\)) is the collection of hyperparameters \(\{\phi_1...\phi_E, V_\epsilon, \sigma^2\}\), and the covariance function is defined by:
\[ \Sigma(\mathbf{x}_t,\mathbf{x}_s)=\sigma^2 {\exp[-\sum_{i=1}^{E}\phi_i(x_{it}-x_{is})^2]} \]
Therefore, for a given training data set, we want to find the optimal fit by using these three parameters:
\(\phi\) : inverse length scale hyperparameter
\(V\) ; process variance
\(\sigma^2\) : pointwise prior variance
Two ways to fit a GP-EDM model:
E/tau Method: Supply a time series \(y\) and values for \(\tau\) (tau, time delay) and \(E\) (number of lags to consider) . The model fitting function will construct lags internally (\(E\) lags of \(y\) with spacing \(\tau\)) and use those as predictors. This method is convenient, but has some limitations. First, it assumes a fixed \(\tau\), so you can’t use unevenly spaced lags. Second, there are limited ways in which covariates can be incorporated.
Custom Lags Method: Construct the lags yourself externally, and supply the exact response variable and the exact collection of predictor variables to use. The model fitting function will use those directly. This allows for the most flexibility and customization, in that you can use any response variable and any collection of lag predictors from any number of variables.
\(\tau\) is the number of timesteps into the past that we are using for prediction. Considerations in choosing \(\tau\) include:
Forecasting needs - over what time horizons are we trying to forecast? For example, if we have 10-minute data, probably a tau of 1 is not very useful because it would just be using data 10 minutes ago to predict data 10 minutes into the future. However, if we have daily data and we want a forecast for tomorrow, using a tau of 60 days might not be very useful because the model would only be using data that occurred two months ago to make a forecast of tomorrow (looking at more recent data would be beneficial).
Autocorrelation in the data - if we want to avoid just relying on autocorrelation for our predictive capacity, we might consider choosing a larger value of tau (\(\tau\)); one way to assess this is to visualize the ACF of the target variable and choose a tau that is approximately the minimum (or where ACF approaches 0 or levels off). Tau (\(\tau\)) should NOT be so incredibly sensitive that changing it from, e.g., 88 to 90 to 92 days should make a huge difference in model fit.
To get a sense of what we have read, let’s have fun with some simulated data !
Fake data: 3-species chaotic Hastings-Powell model
The Hastings-Powell model is a mechanistic model describing a three-species food chain, consisting of a prey, a middle predator, and a top predator (Hastings & Powell 1991). It is known to exhibit chaotic dynamics under certain parameter values.
Let:
- \(x\) be the prey population,
- \(y\) be the middle predator population,
- \(z\) be the top predator population.
The system of differential equations is given by:
\[ \begin{aligned} \frac{dx}{dt} &= x(1 - x) - \frac{a_1 x y}{1 + b_1 x} \\\\ \frac{dy}{dt} &= \frac{a_1 x y}{1 + b_1 x} - d_1 y - \frac{a_2 y z}{1 + b_2 y} \\\\ \frac{dz}{dt} &= \frac{a_2 y z}{1 + b_2 y} - d_2 z \end{aligned} \]
Where:
- \(a_1, a_2\) are the predation rates,
- \(b_1, b_2\) are the handling time coefficients,
- \(d_1, d_2\) are the natural death rates of the middle and top predators, respectively.
This model incorporates Holling Type II functional responses and can exhibit periodic or chaotic dynamics depending on the choice of parameters.
Code
# chaotic 3-species Hastings-Powell model
dplyr::glimpse(HastPow3sp)Rows: 501
Columns: 4
$ Time <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
$ X <dbl> 0.8000000, 0.8293102, 0.8399347, 0.8398126, 0.8332662, 0.8223340,…
$ Y <dbl> 0.1000000, 0.1036896, 0.1090985, 0.1159568, 0.1242236, 0.1339824,…
$ Z <dbl> 9.000000, 8.986057, 8.974955, 8.967575, 8.964660, 8.966918, 8.975…
Code
n = nrow(HastPow3sp)
set.seed(1)
# We will add a bit of observation noise to the data.
HastPow3sp$X <- HastPow3sp$X + rnorm(n,0,0.05)
HastPow3sp$Y=HastPow3sp$Y+rnorm(n,0,0.05)
HastPow3sp$Z=HastPow3sp$Z+rnorm(n,0,0.05)
par(mfrow=c(3,1), mar=c(4,4,1,1))
plot(X~Time, data = HastPow3sp, type="l")
plot(Y~Time, data = HastPow3sp, type="l")
plot(Z~Time, data = HastPow3sp, type="l")E/tau Method: fitting GPEDM model for a single time series
In this example, we split the original dataset (without lags) into a training and testing set, and fit a model with some arbitrary embedding parameters using the E/tau Method. For expediency we will put the test data in fitGP as well to generate predictions with one function call. You could omit it though.
Code
#E/tau Method with training/test split
HPtrain <- HastPow3sp[1:200,]
HPtest <- HastPow3sp[201:300,]Code
gp_out1 <- fitGP(data = HPtrain,
time = "Time",
y = "X",
E = 3, # number of lags to be used: 3
tau = 2,# timely spaced by two lags , so we want to predict X by X-2, X-4 and X-6
newdata = HPtest)The output of fitGP is a rather involved list object containing everything needed to make prediction from the model (much of which is for bookkeeping), but there are a few important outputs you may want to look at and extract. See help(fitGP) for more detail about the outputs.
The summary function quickly gives you the fitted values of the inverse length scales phi for each predictor, the variance terms, the \(R^2\) value for the training data (In-sample R-squared), and the \(R^2\) value for the test data (Out-of-sample R-squared) if test data were provided.
Code
summary(gp_out1)Number of predictors: 3
Length scale parameters:
predictor posteriormode
phi1 X_2 0.12878
phi2 X_4 0.00832
phi3 X_6 0.06941
Process variance (ve): 0.1422683
Pointwise prior variance (sigma2): 2.648974
Number of populations: 1
In-sample R-squared: 0.8702953
Out-of-sample R-squared: 0.7764384
Then we plot. Note that by default, plot will plot the out-of-sample predictions (for the test data, or the leave-one-out predictions if requested) if available.
Code
plot(gp_out1)With ggplot:
Code
ggplot(gp_out1$insampresults, aes(x = timestep,y = predmean)) +
geom_line() +
geom_ribbon(aes(ymin = predmean - predfsd, ymax = predmean + predfsd)) +
geom_point(aes(y = obs), shape = 21) +
theme_bw()To see the in-sample fit, just do that
Code
head(gp_out1$insampresults, 10) timestep pop predmean predfsd predsd obs
1 0 1 NA NA NA 0.7686773
2 1 1 NA NA NA 0.8384924
3 2 1 NA NA NA 0.7981533
4 3 1 NA NA NA 0.9195766
5 4 1 NA NA NA 0.8497416
6 5 1 NA NA NA 0.7813106
7 6 1 0.8386267 0.010150692 0.06686471 0.8321681
8 7 1 0.8032712 0.013345600 0.06742372 0.8266759
9 8 1 0.8403718 0.009173628 0.06672338 0.7967478
10 9 1 0.8270748 0.014528528 0.06766780 0.7266621
You may note that the first E*tau = 6 values are missing, which will happen when using the E/tau Method, since data prior to the start of the test dataset are unknown. This can be prevented using the Custom Lags Method. This approach also allows you to change the forecast horizon by changing the lags used. For instance, if you wanted to use a forecast horizon of 3 with a tau of 2, you would change the lags to 3, 5, and 7.
If you want to make predictions for another test data set, say the next 100 values, you can use predict rather than refit the model. The function plot will work on this output as well. The function summary will not, but you can access the fit statistics and predictions directly.
Custom Lags Method: fitting GPEDM model for a single time series
The GPEDM function for making lags is makelags, which makes lags \(𝑦_{𝑡−𝜏},𝑦_{𝑡−2𝜏},...,𝑦_{𝑡−𝐸𝜏}\) . The lag is indicated after the underscore. If append=T, it will return data with the lags appended, otherwise just the lags.
Code
Time X Y Z X_2 X_4 X_6
1 0 0.7686773 0.08515657 8.956461 NA NA NA
2 1 0.8384924 0.04452754 8.996594 NA NA NA
3 2 0.7981533 0.10966309 8.978425 0.7686773 NA NA
4 3 0.9195766 0.16553683 8.884442 0.8384924 NA NA
5 4 0.8497416 0.20392200 9.005202 0.7981533 0.7686773 NA
6 5 0.7813106 0.06534688 8.871301 0.9195766 0.8384924 NA
7 6 0.8321681 0.13290436 8.912729 0.8497416 0.7981533 0.7686773
8 7 0.8266759 0.21658295 9.039756 0.7813106 0.9195766 0.8384924
9 8 0.7967478 0.11811768 8.984980 0.8321681 0.8497416 0.7981533
10 9 0.7266621 0.06469385 9.031512 0.8266759 0.7813106 0.9195766
Then we split the data set:
Code
# Split data set
HPtrain_lags <- HastPow3sp_lags[1:200,]
HPtest_lags <- HastPow3sp_lags[201:300,]Then we fit the GPEDM model:
Code
Number of predictors: 3
Length scale parameters:
predictor posteriormode
phi1 X_2 0.12878
phi2 X_4 0.00832
phi3 X_6 0.06941
Process variance (ve): 0.1422683
Pointwise prior variance (sigma2): 2.648974
Number of populations: 1
In-sample R-squared: 0.8702953
Out-of-sample R-squared: 0.8018431
Code
plot(gp_out2a)With ggplot:
Code
ggplot(gp_out2a$insampresults, aes(x = timestep, y = predmean)) +
geom_line() +
geom_ribbon(aes(ymin= predmean - predfsd, ymax = predmean + predfsd)) +
geom_point(aes(y = obs), shape = 21 ) +
theme_bw()Now, if you want to check the predictions of the in-sample, just do the same as before:
Code
head(gp_out2a$insampresults, 10) timestep pop predmean predfsd predsd obs
1 0 1 NA NA NA 0.7686773
2 1 1 NA NA NA 0.8384924
3 2 1 NA NA NA 0.7981533
4 3 1 NA NA NA 0.9195766
5 4 1 NA NA NA 0.8497416
6 5 1 NA NA NA 0.7813106
7 6 1 0.8386267 0.010150692 0.06686471 0.8321681
8 7 1 0.8032712 0.013345600 0.06742372 0.8266759
9 8 1 0.8403718 0.009173628 0.06672338 0.7967478
10 9 1 0.8270748 0.014528528 0.06766780 0.7266621
Otherwise, if you want to check the predictions of the out-of-sample, just do the same as before. Note that in this case you got the prediction of the first six test data observations!
Code
head(gp_out2a$outsampresults, 10) timestep pop predmean predfsd predsd obs
1 200 1 0.9660618 0.015604646 0.06790698 0.9900145
2 201 1 0.9490278 0.008292018 0.06660789 1.0517643
3 202 1 0.9794000 0.018911390 0.06874223 1.0439760
4 203 1 0.9630971 0.019025391 0.06877368 0.9448869
5 204 1 0.9828270 0.019624579 0.06894184 0.8433004
6 205 1 0.9599322 0.014307521 0.06762070 1.0777695
7 206 1 0.8751934 0.020737125 0.06926674 0.9805625
8 207 1 0.9854028 0.035131635 0.07484708 0.9673377
9 208 1 0.9713476 0.024072342 0.07033727 0.9310551
10 209 1 0.9685994 0.016690676 0.06816474 0.9466032
Why, in this case , do I get the first six out of sample predicted observations? Does the Custom lags method use the previous six observations from the training data set, with the lags I have defined by the makelagsfunction? Is it correct my interpretation?
Multivariate models
Multiple predictor variables (and their lags) can be used in either fitGP by listing them under x. If you are using the same number of lags and the same tau for each variable, you can use the E/tau Method. If you want to mix and match predictors, you can use the Custom Lags Method and pick any combination. Since fitGP standardizes each input internally, and there are separate length scales for each input, data standardization is not as important as it is for other methods like Simplex and S-map.
Code
[1] "Time" "X" "Y" "Z" "X_1" "X_2" "Y_1" "Y_2" "Z_1" "Z_2"
Code
Number of predictors: 3
Length scale parameters:
predictor posteriormode
phi1 X_2 0.12043
phi2 Y_2 0.01721
phi3 Z_2 0.01313
Process variance (ve): 0.1464649
Pointwise prior variance (sigma2): 2.49394
Number of populations: 1
In-sample R-squared: 0.8635693
Out-of-sample R-squared: 0.8340371
Why can’t I fit the model by stating that X and Y are response variables ? e.g.: y = c("X","Y")
Error in is.data.frame(x) : 'list' object cannot be coerced to type 'double'
Just to see the predictions:
Code
ggplot(gp_out_mv2$insampresults, aes(x = timestep, y = predmean)) +
geom_line() +
geom_ribbon(aes(ymin = predmean - predfsd, ymax = predmean + predfsd), alpha=0.4) +
geom_point(aes(y = obs)) +
theme_bw()+
facet_wrap(pop ~., scales = "free") Hierarchical models (time series from multiple populations)
As an example, here are some simulated data from 2 populations (PopA and PopB) with theta logistic dynamics. This dynamics derive from a model that is an extension of the standard logistic growth model. It introduces a parameter (theta) that influences the shape of the growth curve and can account for different types of density dependence. The data that we are going to use contain some small lognormal process noise, and the populations have different theta values indicating distinct growth characteristics.
Code
str(thetalog2pop)'data.frame': 100 obs. of 3 variables:
$ Time : int 1 2 3 4 5 6 7 8 9 10 ...
$ Population: Factor w/ 2 levels "PopA","PopB": 1 1 1 1 1 1 1 1 1 1 ...
$ Abundance : num 1.5621 0.0169 0.0583 0.1878 0.6457 ...
Code
When dealing with multiple populations, you have to think a little more carefully about whether you want to scale data across populations (scaling = "global") or within populations (scaling = "local"). Since the data are on somewhat different scales and don’t necessarily represent the same ‘units’, we will use local scaling, as opposed to global. For a more detailed discussion of this, see the next lesson.
The model setup for a hierarchical model is the same as other models, you just need to add a pop column to indicate there are multiple populations. I am not including an exclusion radius for leave-one-out in this case, because the data are not autocorrelated.
Code
Number of predictors: 3
Length scale parameters:
predictor posteriormode
phi1 Abundance_1 0.5317
phi2 Abundance_2 0.0000
phi3 Abundance_3 0.0000
Process variance (ve): 0.01222333
Pointwise prior variance (sigma2): 2.527349
Number of populations: 2
Dynamic correlation (rho): 0.327254
In-sample R-squared: 0.9933975
In-sample R-squared by population:
R2
PopA 0.9970811
PopB 0.9815187
Out-of-sample R-squared: 0.991238
Out-of-sample R-squared by population:
R2
PopA 0.9961280
PopB 0.9754702
From the summary, we can see that automatic relevance determination (ARD) approach has (perhaps unsurprisingly) deemed lags 2 and 3 to be unimportant (\(\phi_i\) phi values are 0), so E = 1 is probably sufficient. The fitted dynamic correlation (rho = 0.32) tells us that the dynamics are rather dissimilar. The within population and across population \(R^2\) values are also displayed (these are normalized by either the within or across population variance).
Remember: The values of \(\phi_i\) tell you about the wiggliness of the function with respect to each input: a value of 0 indicates no relationship (the function is flat), low values of \(\phi_i\) indicate a mostly linear relationship (the function is very stiff), and large values of \(\phi_i\) indicate more nonlinear relationships (the function is more wiggly)
What are low and large in this case?
low values of \(\phi\) indicate a mostly linear relationship (the function is very stiff), and large values of \(\phi\) indicate more nonlinear relationships.
For more information on fitting hierarchical models, including the use of pairwise dynamic correlations with >2 populations, see this vignette.
Would be possible to use a continuous variable as level, such as lake volume ? So instead of using pop categorical column use volume continuous column. I feel this is what the GP regression is made for … but I am not quite sure how to fit it within the GPEDM package.
Visualizing conditional effects (the kernel function)
One thing you might be wondering is how we can visualize the shape of our fitted function 𝑓. If 𝑓 is 1- or 2-dimensional, we can create a grid of input values, pass them as newdata, and plot the resulting line or surface. If the function is more than 2 dimensional, this gets tricky, but we can look at slices of 𝑓 with respect to certain predictors, holding the values of the other predictors constant. The function getconditionals is a wrapper that will compute and plot conditional responses to each input variable with all other input variables set to their mean value. Although it does not show interactions, it can be useful for visualizing the effects of the predictors. Can also be used to check that your function isn’t doing something nonsensical and inconsistent with biology. To create 2-d conditionals surfaces, or to look at conditional responses with other predictors fixed to values other than their mean, you can pass an appropriately constructed grid to newdata. Visualizing models this way makes it possible to think about EDM from a more ‘mechanistic’ functional viewpoint.
Code
#From the multivariate Hastings-Powell simulation
getconditionals(gp_out_mv2)Code
#From the 2 population theta logistic
getconditionals(tlogtest)Species interactions strengths
From (Yukihira et al. 2025):
When inferring the Jacobian elements from time series, GP-EDM approach has three advantages over S-map methods.
- GP-EDM can flexibly control the nonlinearity of each state variable by the automatic relevance determination (ARD) approach. This is in contrast to S-map algorithms, which assume uniform nonlinearity over all state variables. This improves the identifiability of the effects of each species with the GP-EDM inference.
- GP-EDM explicitly incorporates the uncertainty of Jacobian elements into the model structure as well as process noise. In S-map, process noise degrades the estimation of the nonlinearity parameter of the kernel function (i.e. \(\theta\)) by confusing it with the Jacobian uncertainty. Thus, GP-EDM may outperform S-map and regularized S-map in the presence of process noise.
- unlike S-map methods, GP-EDM can reconstruct the local shape of the nonlinear map around each data point. This means that GP-EDM recovers Jacobian elements even when the focal dynamics are highly nonlinear.
T. Rogers: Just as with S-map, it is possible to obtain local slope coefficients from a GP model. The partial derivatives of the fitted GP function at each time point with respect to each input can be obtained from the predict function, setting returnGPgrad = T. They will be under model$GPgrad. The gradient will be computed for each out-of-sample prediction point requested (using methods "loo", "lto", or newdata). If you want the in-sample gradient, pass the original (training) data back in as newdata.
Code
'data.frame': 200 obs. of 3 variables:
$ d_X_2: num NA NA 0.878 0.862 0.95 ...
$ d_Y_2: num NA NA -0.569 -0.413 -0.538 ...
$ d_Z_2: num NA NA 0.0363 0.0213 0.0352 ...
Code
May you comment on this function:
getGPgrad = function(phi, Xnew, X, Cdi, iKVs, Y) {
grad = t(-2 * t(t(Xnew-t(X)) %% diag(phi)) %% (t(Cdi) * iKVs %*% Y)) }
What does this function do? I am interested to understand it because using this function we can extract the posterior of the partial GP derivatives (S-map Jacobian elements , a.k.a species interactions strengths). Can we have the full posterior with mean, and 95% CI ?
Some expectations/ hypothesis
Let’s generate some simulate data and plot them .
Question:
To what extent does invasive species presence/abundance influence plankton emergent properties?
Code
get_GP_simulation <-
function(x = seq(0,10, by = 0.1),
Intercept = 0,
slope = 1,
sigma = 1,
sigma_f = 0.5,
lambda = 100,
seed = NULL) {
if (! is.null(seed)){
set.seed(seed)
}
# number of points to generate prediction for
N <- length(x)
# linear predictor (vanilla LM)
eta = Intercept + slope * x
# kernel function (here: radial basis function kernel)
get_covmatrix <- function(x, sigma_f, lambda) {
K = matrix(0, nrow=N, ncol=N)
for (i in 1:N) {
for (j in 1:N) {
K[i,j] = sigma_f^2 * exp(sqrt((i-j)^2) / (-2 *lambda))
}
}
return(K)
}
# covariance matrix
K <- get_covmatrix(x, sigma_f, lambda)
# Gaussian process wiggles
epsilon_gp <- mvtnorm::rmvnorm(
n = 1,
mean = rep(0, N),
sigma = K)[1,]
# central tendency
mu <- epsilon_gp + eta
# data prediction
y <- rnorm(N, mu, sigma)
tibble(x, eta, mu, y)
}Then plot:
Code
# simulate
sim <- get_GP_simulation(
x = seq(-1 , 1, length.out = 500),
Intercept = 0,
slope = 0.3,
sigma_f = 0.5,
lambda = 20,
sigma = 0.05)
# then plot
ggplot(sim, aes(x = x, y = eta )) +
#geom_line(color = project_colors[1], size = 1.25) +
geom_line(aes(y = mu), color = "black", linewidth = 0.85) +
geom_point(aes(y = y), color = "darkgrey", shape = 21, size = 1.5)+
theme_bw(base_size = 14)+
scale_y_continuous(
name = "Connectance (%)", # axis title
labels = function(breaks) {
# Map the default breaks linearly to a new range, e.g., 10 to 70
scales::rescale(breaks, to = c(10, 70))
})S-Map: multiview distance regularised algorithm
Preliminary information:
What is an S-Map: it is a local linear fit with state-space dependent kernel function.
The regularized S-Map algorithm was presented by Cenci et al 2019 Meth Eco. Evo to reduce the S-Map sensitiveness to process noise.
You should know that when you use the regularized S-Map algorithm, the regularization scheme and the structure of the kernel function MATTER a lot (Cenci et al. 2019). You shoul use different regularization and/or kernel functions for whether inference or forecasting is the ultimate goal of your research study.
But the classical regularized S-map algorithm allows a limited number of variables (or network nodes) being embedded into the model (Deyle et al. 2016; Ushio et al. 2018).
And this restriction arises due to the “curse of dimensionality” (Munch et al. 2023) (Bellman, 1957).
Curse of dimensionality: In fact if we increase the number of predictors then we need “a lot” of data to reconstruct the “true” attractor.
More precisely, the number of possible attractor reconstructions grows combinatorially with the number of variables we use (Ye & Sugihara 2016a). For example, the number of distinct three-dimensional combinations (three-dimensional embeddings) for a system with up to 3 lags of 10 variables is nearly 3000. Imagine: ¡ Three thousand attractors with equivalent, or near equivalent fits !!!!
Therefore Chang et al. 2021 proposed the new algorithm called multiview distance regularised S-map (MDR S-map) that quantifies the neighbouring relationships among high-dimensional data points in the attractor reconstruction.
-
The MDR S-map algorithm is no more than the sum of two steps:
1) measuring multiview distances among m-dimensional data from various E-dimensional multiview attractors that embeds various combinations of variables (Ye & Sugihara 2016b) and 2) determining m-dimensional interaction jacobian matrix at each time point by plugging-in the multiview distances to the S-map algorithm with regularization constraints (Cenci et al. 2019). m = number of nodes Figure 1.
Figure 1: The MDR S-map consists of two steps Please note that SSR is synonim of attractor. The aim of the first step is to obtain the multiview distance that operates at the optimal embedding dimension (\(E\)), which is much smaller than the dimensionality of large interaction networks (\(m\), the number of nodes). Under the curse of dimensionality, neighbourhood relationships among data points cannot be precisely inferred from the distance \(d_M(\cdot)\) among high-dimensional data points in manifold \(M_x\) that summarises the dynamics of entire interaction networks. Therefore, (1–1 Figure 1) we recovered the neighbourhood relationships among high-dimensional data points \(X(t)\) from numerous low-dimensional state space reconstructions (SSR; \(M_1, M_2, \ldots, M_C\)), that is, multiview SSR. Among these SSRs, (1–2) we computed distances (L2 norm, i.e., Euclidean distance) between data points under the optimal embedding dimension \(E\). Collecting all these distances, (1–3) we obtained the multiview distances (\(d_E\)) and (1–4) determined the data weights (\(W_E\)) for the following S-map analysis. In the second step, we estimated the high-dimensional interaction strength (\(B\)) by S-map, based on locally weighted least-square optimisation \(\arg\min_B (|\cdot|_2)^2\) with regularisation \(\lambda[\cdot]\) incorporated. Specifically, (2–1) the weights \(w_E = \exp(-\theta d_E / \mathrm{mean}(d_E))\), derived in the first step, were plugged into the S-map optimisation algorithm. (2–2) Under the constraint of regularisation (\(\lambda\) and \(\alpha\) are the penalty factors of regularisation), (2–3) we solved the high-dimensional local linear coefficients to approximate interaction strengths at each time step.
Zurich lake example
I am going to use the dataset from (Merz et al. 2023). These are monthly time-series from Lake Zurich (Switzerland) of 13 plankton functional groups, phosphate concentration, and water temperature from January, 1978 to December, 2019.
This dataset was used by (Medeiros et al. 2025) and I add here figure S19 , from their Supplementary materials: