A didactically motivated PLS prediction algorithm

The intention of this paper is to develop an easily understood PLS prediction algorithm, especially for the control community. The algorithm is based on an explicit latent variables model, and is otherwise a combination of the previously published Martens and Helland algorithms. A didactic connection to Kalman filtering theory is provided for a methodological overview.


Introduction
The partial least squares regression (PLSR) algorithms of Wold and Martens provide alternative, powerful tools for handling ill-conditioned multivariate regression data, see e.g.Martens and Nees (1989) and Höskuldson (1996) for overviews of the fundamental multivariate calibration concept, in which are presented the above two slightly different versions of the PLSR method.Both algorithms assume collinear regressor variables generated by underlying latent variables, and although they are different with respect to the score and loading matrices involved, they give identical predictors.
There is certainly no practical need for yet another algorithm that provide the same predictor as the well-known algorithms of Wold and Martens.However, these algorithms, as presented in the chemometrical literature, make no use of the dynamic systems theory available for readers from the control community, and the main aim of the present paper is therefore to provide a predictor derivation at first especially for this category of readers.
This actually also results in a simplified version of the Martens prediction algorithm making use of the so-called loading weight vectors only, which would appear to be of general interest (see Esbensen (2000) for a discussion of loadings and loading weights).The score vectors, which are also essential elements of chemometrics, are easily computed once a preliminary predictor is found.
The new algorithm has a lot in common with an algorithm developed by Helland (1988), and a similar algorithm by Di Ruscio (2000).Again, the reason behind the present modifications is mainly didactic.The key step in the simplification is the use of an explicit latent variables model, which facilitates an early introduction of the Helland predictor form using only loading weight vectors.The present exposition, we believe, constitute a novel, easy, and complete introduction to the prediction aspect of multivariate calibration.

The multivariate calibration problem
Assuming a standard regression problem with a scalar response variable y and multivariate regressor variables x, the PLS1 setting in chemometrics, the object is to find a predictor b from empirical or experimental data, that may be used to predict a new response yo from new observations xo according to r (1) Yo = xo b 1 The specific multivariate calibration problem arises when the number of x variables is larger than the number of observations in the available data, which calls for regularized solutions.
Remark 1 In typical chemometric applications the regressor variables represent more or less noisy measurements.From a strict systems engineering point of view the notation y1 for the response and y2 for the regressors would then be more natural.However, here we shall use the well-established chemometrical notation.Martens and Naes (1989) give an abundance of didactic multivariate calibration problems as seen from the chemometric point-of-view when dealing with what might be called static multivariate calibration (see Ergon (1998Ergon ( , 1999) ) for discussions of dynamic counterparts).One illustrative archetypal example of modern analytical instrument multivariate calibration is that of prediction of protein content in whole wheat kernels based on near infrared (NIR) spectroscopy (Norris, 1993).Here, the protein content is the response variable y, while the instrumental NIR reflectance at a large number of frequencies serve as the x variables.This example actually represents a modern breakthrough for practical chemometric multivariate calibration, allowing rapid NIR to replace the traditional slow, wet-chemical determinations.The fundamental problem in such cases is that the number of x variables may be much larger than the actual number of observations in the calibration (training) data, which gives rise to well-known statistical degrees-of-freedom problems.Also, the X data matrix typically is made up of significantly collinear variables, each of which is usually also fraught with a non-trivial measurement error.This case clearly will spell disaster for e.g. a multivariate linear regression (MLR) approach.

Least squares estimation
Assuming experimental data from N observations, y= [ Y1 y 2 ... yN] T and X = [x 1 x2 ... xN] T , and independent observation errors, we find the least squares (LS) regression solution (e.g.Johnson and Wichern, 1992 With a large number p of x variables, this solution will be very noise and collinearity sensitive, and in practical applications the LS method will give satisfactory results only when p is well below N. A general and detailed analysis of this problem for limited values of N is beyond the scope of the present paper (see e.g.Belsley (1991), but the issue is also well dealt with in the chemometrical literature).

Latent variables model
When, as indeed in very many practical multivariate calibration situations, the large number of x variables are significantly to highly collinear, the regressor information may be compressed into a much smaller number of latent variables T = [i i 12 ... Tar' (e.g.Burnham et al. (1999), Martens and Naes (1989) and Høskuldson (1996)).The model underlying such data compression is where ek , v 1,k and v2,1 are white noise sequences with covariances Re , r11 and R22, and where we assume centered data.This is a special case of a more general dynamic model with Tk+ 1 = ATk + ek , i.e. we use here A = 0. We will also assume that W is orthonormal, i.e. that W TW = I.
With N observations and T , the latent variables model (3) gives the output equations Remark 2 If the regressor data are actually generated from an underlying state vector zk as xk = C2zk + v 2,1, the output equation ( 5) will be replaced by X = ZCZ + V2 .In the special noise free case with V2 = 0, and with 1\ oo, we may then find T and W T by a similarity transformation based on factorization of X by a number of alternative methods, e.g.PCA and PLSR as described below.In practice we will always observe some noise and have a limited N and factorization of X then gives only estimates T and WT.

Regularized solutions
The LS solution of ( 5) is and from equations ( 4) and ( 6) we thus find the LS predictor related to the latent variables This predictor is also given in Helland (1988), although not directly based on an LV model.
The problem now is to find W or more realistically good estimates W, and in this endeavour we have in fact a number of possibilities.A simple choice is W = I i ,, which brings us back to the LS solution (2).Other choices give the standard statistical PCR and standard chemometrical PLSR solutions as discussed below.The interested reader might note that the theoretically optimal regularization is given by a Kalman gain (see Appendix A for details).

Principal component regression
In PCR the weighting matrix estimate is W = WPCR = P, where P is the loading matrix related to a principal components decomposition of X (Johnson and Wichern, 1989).We may find T = TPCR = U1S 1 and P = V1 from the singular value decomposition The latent variables represented by the score matrix TPCR are thus based on only X information.
In chemometrics there is a strong tradition for using the NIPALS algorithm for this decomposition (e.g.Martens and Naes, 1989).In the expression we then successively maximize the sample variances tits/(N-1), 02 /(N-1) under the constraint that i 2 is orthogonal to 1 1i 0 3 /(N-1) under the constraints that 13 is orthogonal to t l and 12 etc.

Partial least squares regression
Some of the latent variables represented in TpcR = [i i t2 ... ia ] may be weakly correlated with the response variable in y.This is where the PLSR solution suggests to use both X and y information in order to find an improved version of W = [W 1 W 2 ... *a ], the so-called loading weight matrix.In the Wold and Martens algorithms this is done by a step wise computation of wl , w2 , ..., *a (e.g.Martens and Naes, 1989), but a one-step procedure is also available (Di Ruscio, 2000).
The so-called Martens algorithm is based on the factorization X -TpLs W ks + E = t 1 wi + i2 wz + ... + ta *å + Ea and the following modified algorithm has the same starting point.
Remark 3 The Martens algorithm uses a non-orthogonal score matrix, i.e.TP stPLS is non-diagonal, while the alternative Wold algorithm makes use of an orthogonal score matrix (e.g.Martens and Naes, 1989).
In the first step we try to explain y by use of only one component in equation ( 13), i.e. by using where it follows from equation ( 6) that tl = Xw l (15) We then maximize the sample covariance ii y/(N-1) = wi Xy l (N-1) under the constraint that * 1 has the length /* w 1 = 1 (instead of maximizing iT ii /(N-1) as in PCR).We find the maximum when * 1 has the same direction as X Ty, i.e.
wl = c 1 X T Y ( 16) where c 1 = (yTXX T y) -1/2• Remark 4 The scaling of r' i is not absolutely necessary, i.e. we may use c,= 1 (Helland, 1988), but it is very often carried out for practical reasons, and it furthers the most comprehensive interpretation possibilities according to the chemometric tradition.
Remark 5 Maximization of the sample covariance is the favored ad hoc chemometric solution.The theoretically optimal solution assuming known covariances R, and R22 is to let W be a transposed Kalman gain (see Appendix A).
What has just been completed for the first PLSR component can now be iterated, resulting in the next, orthogonal component.Thus in the second step we try to explain the residual El by using the second component in equation ( 13), i.e. by maximizing the sample covariance i 2 T, E 1 /(N-1) = w2T X TE 1/(N-1) under the constraints Vw2 w 2 = 1 and w2 T * 1 = 0.The result now is w2 = C2 Tg 1 (19) where c2 = (ef XX TE 1 ) -1/2 Here is remains to prove that *I * 1 = 0 (taken care of in Appendix B).
After this second step we have W PLS = W2 = [w 1 w2 ], the fitted responses according to equation (8 This contemporary residual, E 2 , is used as input to the third step etc.We thus use more and more detailed factorizations of X (adding new PLSR components) in order to explain consecutive residuals of y.

A new didactic prediction algorithm
The algorithm developed above is as follows (where the scaling in equation ( 23) may be omitted): 1. Set a = 1 and co = y.3. Let a Fa + 1 and go to step 2.
The algorithm thus produces a sequence of predictors b l , b 2 etc.The appropriate number of column vectors * 1 , *2 , ... , wA to use is of critical importance.In general it would be unavoidable to either underfit or overfit this modeling without an unambiguous stopping rule, i.e. the "optimal number of PLSR components" must be decided upon by a suitable evaluation of "the modeling fit".As Höskuldson (1996) has pointed out, this optimization criterion must include terms which reflect both the X-modeling fit as well as prediction error minimization in fact he devised a new compound principle, the H-principle, for nhtaining the optimal balance between these two terms, ibid.
In the chemometric practise, there has been developed a tradition for using an empirical test for finding the "optimal prediction strength" via a suitable validation procedure, which is often of the cross-validation form, e.g.Martens and Naes (1989).But Esbensen (2000) and Esbensen and Huang (2001) have been adamant in pointing out many deficiencies regarding cross-validation, while demonstrating the almost universal need for validation against an independent, so-called "test data set" (the test set validation imperative).A proper validation is an essential part of any multivariate data modeling, not just prediction modeling, ibid.

Remark 6
The new algorithm utilizes residuals of y in order to find consecutive loading weight vectors *a.The Wold and Martens algorithms use residuals of X as well, and this may also be included here.However, this may seen somewhat contrived and confusing when the goal is to explain y in more and more detail by use of the information available in X, and in the present algorithm it is in any case unnecessary.
A very significant part of practical chemometrics is interpretation based on graphical modeling, viz. the score and loading weight matrices etc., Esbensen (2000).These model results must also be computed, either as a part of the algorithm (for the entire series of components calculated), or separately, i.e. for specific values of a (29) (30) (the optimal number of PLS-components).We can find the non-orthogonal score matrix for X as Ta = XWa (27) and the loading vector for y (see Martens and Naes (1989)) as q{a =(Wå XTXWa) -1 * XTy (28)

Conclusion
A combined version of the Martens and Helland PLSR algorithms based on an explicit latent variables model and making use of only loading weight vectors * a is developed from a novel didactic perspective.The resulting predictor is identical to the Helland predictor, and it is furthermore equivalent with the Wold and Martens predictors.Estimates of the non-orthogonal score matrix T and the loading vector q may also be computed as desired.
It is emphasized that we are here exclusively dealing with the so-called nonorthogonal PLSR algorithms, which does not allow a similar insight into the specific PLSR models as can be achieved by using the alternative so-called Wold-models, in which T is indeed orthogonal, as is W (Appendix B).It has been pointed out in chemometrics, that exclusion of this latter information may at times severely cripple the usefulness of the PLSR models with respect to all other aspects than mere prediction, Esbensen (2000).While this may not appear to be of specific importance to the control community, it certainly is in the static chemometric realm.We shall address these aspects in a companion paper.
A connection to Kalman filtering theory is given in Appendix A.