User Guide
There are 3 main actions needed to train and use the different models:
Initialization
Possible models
There are currently 3 possible Gaussian Process models:
GP
corresponds to the original GP regression model, it is necessarily with a Gaussian likelihood.
GP(X_train,y_train,kernel)
VGP
is a variational GP model: a multivariate Gaussian is approximating the true posterior. There is no inducing points augmentation involved. Therefore it is well suited for small datasets (~10^3 samples)
VGP(X_train,y_train,kernel,likelihood,inference)
SVGP
is a variational GP model augmented with inducing points. The optimization is done on those points, allowing for stochastic updates and large scalability. The counterpart can be a slightly lower accuracy and the need to select the number and the location of the inducing points (however this is a problem currently worked on).
SVGP(X_train,y_train,kernel,likelihood,inference,n_inducingpoints)
OnlineSVGP
is an online variational GP model. It is based on the streaming method of Bui 17', it supports all likelihoods, even with multiple latents.
OnlineSVGP(kernel, likelihood, inference, n_latent, inducing_points)
MOVGP
is a multi output variational GP model.MOSVGP
is a multi output sparse variational GP model, based on Moreno-Muñoz 18'.VStP
is a variational Student-T model where the prior is a multivariate Student-T distribution with scaleK
, meanμ₀
and degrees of freedomν
. The inference is done automatically by augmenting the prior as a scale mixture of inverse gamma
VStP(X_train,y_train,kernel,likelihood,inference,ν)
Likelihood
GP
can only have a Gaussian likelihood, VGP
and SVGP
have more choices. Here are the ones currently implemented:
Regression
For regression, four likelihoods are available :
- The classical
GaussianLikelihood
, for Gaussian noise - The
StudentTLikelihood
, assuming noise from a Student-T distribution (more robust to ouliers) - The
LaplaceLikelihood
, with noise from a Laplace distribution. - The
HeteroscedasticLikelihood
, (in development) where the noise is a function of the input: $Var(X) = λσ^{-1}(g(X))$ whereg(X)
is an additional Gaussian Process andσ
is the logistic function.
Classification
For classification one can select among
- The
LogisticLikelihood
: a Bernoulli likelihood with a logistic link - The
BayesianSVM
likelihood based on the frequentist SVM, equivalent to use a hinge loss.
Event Likelihoods
For likelihoods such as Poisson or Negative Binomial, we approximate a parameter by σ(f)
. Two Likelihoods are implemented :
- The
PoissonLikelihood
: A discrete Poisson process (one parameter per point) with the scale parameter defined asλσ(f)
- The
NegBinomialLikelihood
: The Negative Binomial likelihood wherer
is fixed and we define the success probabilityp
asσ(f)
Multi-class classification
There is two available likelihoods for multi-class classification:
- The
SoftMaxLikelihood
, the most common approach. However no analytical solving is possible - The
LogisticSoftMaxLikelihood
, a modified softmax where the exponential function is replaced by the logistic function. It allows to get a fully conjugate model, Corresponding paper
More options
There is the project to get distributions from Distributions.jl
to work directly as likelihoods.
Inference
Inference can be done in various ways.
AnalyticVI
: Variational Inference with closed-form updates. For non-Gaussian likelihoods, this relies on augmented version of the likelihoods. For using Stochastic Variational Inference, one can useAnalyticSVI
with the size of the mini-batch as an argumentGibbsSampling
: Gibbs Sampling of the true posterior, this also rely on an augmented version of the likelihoods, this is only valid for theVGP
model at the moment.
The two next methods rely on numerical approximation of an integral and I therefore recommend using the classical Descent
approach as it will use anyway the natural gradient updates. ADAM
seem to give random results.
QuadratureVI
: Variational Inference with gradients computed by estimating the expected log-likelihood via quadrature.MCIntegrationVI
: Variational Inference with gradients computed by estimating the expected log-likelihood via Monte Carlo Integration
We also use AdvancedHMC.jl to provide a HMC algorithm, although generally the Gibbs sampling is preferable when available.
Compatibility table
Not all inference are implemented/valid for all likelihoods, here is the compatibility table between them.
Likelihood/Inference | AnalyticVI | GibbsSampling | QuadratureVI | MCIntegrationVI |
---|---|---|---|---|
GaussianLikelihood | ✔ (Analytic) | ✖ | ✖ | ✖ |
StudentTLikelihood | ✔ | ✔ | ✔ | ✖ |
LaplaceLikelihood | ✔ | ✔ | ✔ | ✖ |
HeteroscedasticLikelihood | ✔ | (dev) | (dev) | ✖ |
LogisticLikelihood | ✔ | ✔ | ✔ | ✖ |
BayesianSVM | ✔ | (dev) | ✖ | ✖ |
LogisticSoftMaxLikelihood | ✔ | ✔ | ✖ | (dev) |
SoftMaxLikelihood | ✖ | ✖ | ✖ | ✔ |
Poisson | ✔ | ✔ | ✖ | ✖ |
NegBinomialLikelihood | ✔ | ✔ | ✖ | ✖ |
(dev) means that the feature is possible and may be developped and tested but is not available yet. All contributions or requests are very welcome!
Additional Parameters
Hyperparameter optimization
One can optimize the kernel hyperparameters as well as the inducing points location by maximizing the ELBO. All derivations are already hand-coded (no AD needed). One can select the optimization scheme via :
- The
optimiser
keyword, can benothing
orfalse
for no optimization or can be an optimiser from the Flux.jl library, see list here Optimisers. - The
Zoptimiser
keyword, similar tooptimiser
it is used for optimizing the inducing points locations, it is by default set tonothing
(no optimization)
PriorMean
The mean
keyword allows you to add different types of prior means:
ZeroMean
, a constant mean that cannot be optimizedConstantMean
, a constant mean that can be optimizedEmpiricalMean
, a vector mean with a different value for each pointAffineMean
,μ₀
is given byX*w + b
Training
Training is straightforward after initializing the model
by running :
train!(model, 100; callback = callbackfunction)
Where the callback
option is for running a function at every iteration. callbackfunction
should be defined as`
function callbackfunction(model,iter)
"do things here"...
end
Prediction
Once the model has been trained it is finally possible to compute predictions. There always three possibilities :
predict_f(model, X_test, covf=true, fullcov=false)
: Compute the parameters (mean and covariance) of the latent normal distributions of each test points. Ifcovf=false
return only the mean, iffullcov=true
return a covariance matrix instead of only the diagonalpredict_y(model,X_test)
: Compute the point estimate of the predictive likelihood for regression or the label of the most likely class for classification.proba_y(model,X_test)
: Return the mean with the variance of eahc point for regression or the predictive likelihood to obtain the classy=1
for classification.
Miscellaneous
🚧 In construction – Should be developed in the near future 🚧
Saving/Loading models
Once a model has been trained it is possible to save its state in a file by using save_trained_model(filename,model)
, a partial version of the file will be save in filename
.
It is then possible to reload this file by using load_trained_model(filename)
. !!!However note that it will not be possible to train the model further!!! This function is only meant to do further predictions.
🚧 Pre-made callback functions 🚧
There is one (for now) premade function to return a a MVHistory object and callback function for the training of binary classification problems. The callback will store the ELBO and the variational parameters at every iterations included in iterpoints If `Xtestand
y_test` are provided it will also store the test accuracy and the mean and median test loglikelihood