Multivariate Linear Regression: Data models and regression methods

MLR Data model

In MLR-based MVPA, the standard linear model has the following form [1]:

where, apart from the error term, matrix Y contains the predictors (n volumes x p predictors), matrix X contains the fMRI data (n volumes x v voxels) and matrix M the model maps (v voxels x p maps). When p>1, this is equivalent to estimating p diff erent MLR models for the separate p different predictors [2]:

Because the number of voxels v is usually much larger than the temporal dimensions n, the data X are projected onto a space of lower dimensionality, where the actual learning of the parameters takes place. To this purpose, all volumes xi (rows of X, i=1, ..., v) are first transformed using a mapping function F(xi) and then used to construct the so called kernel matrix K with elements Kij  =<F(xi),F(xj)>, where <a,b> denotes the inner (scalar) product of vectors a and b. The simplest kernel is the linear kernel for which we set F(x)=x, thereby K=X*XT.

In the kernel space, the MLR model for a single predictor y becomes:

 

Once the vector w has been estimated by one of the regression methods discussed below, the prediction y' for a new data set X' can be performed by projecting the new data onto the old data, e. g., for the linear kernel, using K=X'*XT

where the v x 1 vector m will the predictive map of the model (column of M) expressing the voxel relevance of the model.


MLR estimation methods

The estimation of the vector w in the MLR model can be performed with the following methods:

The LS solution is commonly used in parameter estimation because it is unbiased and corresponds to the Maximum Likelihood solution if the noise term can be assumed i.i.d.. Most importantly, it computationally tractable. In the kernel space, the LS solutions takes the following form:

The main problem of the LS solution is that, like maximum likelihood solutions, tend to over fit the data: in practice, with more and more parameters to estimate, the LS solution will have zero error on the training but ultimately chance level on new data and therefore will have the poorest generalization performances.

The RR solution was introduced in the 1970's to deal with collinearity, is useful for high-dimensional data and computationally fast. The form of the RR solution is very similar to the LS solution, the main difference being the regularization term lambda:

The lambda parameter can be either set arbitrarily or determined via the generalized cross-validation (GCV) procedure that spans a given interval of values searching for the one that best predicts an internal subset of the measurements from another subset. To this purpose, it is often useful in the search for RR-GCV solution to plot the GCV mean squared error of this prediction against the lambda values to graphically assess whether the local minimum can be considered a global minimum as well, ensuring the best generalization performances of the model.    

The Relevance Vector Machine (RVM) is a probabilistic method and therefore strictly requires a probabilistic, and not algebric, data model formulation. Assuming that the noise term follows an i. i. d. gaussian distribution with zero mean, the conditional probability of the prediction vector, given the above kernel-trasformed MLR model, can be expressed as [3]:

With this formulation, the LS (or Maximum Likelihood) solution in the kernel space would be expressed as:

In addition, the posterior probability of the weights will become: 

Using this formulation, and assuming the weights also independent and normally distributed with zero mean in prior probability of w, and then maximizing this function, leads to the RR solution with regularization parameter proportional to the variance of the weights, which therefore corresponds to the so called maximum a posteriori probability (MAP) solution.

As an alternative to assuming the weights identically normally distributed (i. e. with the same variance), the RVM framework introduces "sparsifying" priors on the weights, allowing a different precision alpha (inverse of variance) for each of them [4]. Namely:

While the components of w are called weights and represent the true parameters to estimate, the alpha parameters are called hyperparameters. A common choice is to use a log-uniform hyperprior over alphas, that in combination with the gaussian prior, allows to implement the so called Automatic Relevance Determination (ARD) framework [5,6]. In fact, during the estimation of the model, many hyperparameters will grow to infinity, so that the corresponding model weight wi will have a posterior distribution concentrated around zero. In other words, only the model weights that are “relevant” given the training data will remain, pruning out the unnecessary ones and leading to a sparse model. The values of the alpha and beta parameters are determined using type-II Maximum Likelihood (known also as evidence approximation) [1]).

Compared to RR-GCV, RVM provides in many applications a much sparser and more compact model,  with little or no reduction of the generalization error [4]. On the other hand, RVM suffers from having high confidence in making predictions in regions far from the training data. Other Bayesian regression techniques, like Gaussian Processes do not suffer from this drawback at the expenses of an increased computational time.

References

[1] Bishop, C.M., 2006. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer.

[2] Hastie, T., Tibshirani, R., Friedman, J.H. 2009. The elements of statistical learning: data mining, inference, and prediction. New York: Springer-Verlag, 2nd edition, February 2009.

[3] Shawe-Taylor, J., Cristianini, N., 2004. Kernel Methods for Pattern Analysis. Cambridge University Press.

[4] Tipping, M.E., 2001. Sparse Bayesian Learning and the Relevance Vector Machine. J. Mach. Learn. Res. 1 (3), 211-244.

[5] MacKay, D.J.C., 1994. Bayesian methods for backpropagation networks. In: Domany, E., van Hemmen, J.L., Schulten, K. (Eds.), Models of Neural Networks III, ch. 6. Springer-Verlag.

[6] Neal, R.M., 1996. Bayesian Learning for Neural Networks. Springer.