Society for Amateur Scientists Regression and Model Building

by Kevin Kilty

Typically when I speak of a model, I mean a mathematical equation of some sort. In Part V of this series I gave the following example of a model...


Y_i = C₀ + C₁x_ai + C₂x_bi + C₃x_aix_bi + n_i
where,

The Y's are outcomes of an experiment. The C's are coefficients which I can determine through the experiment outcome. They describe the gain that each factor supplies in controlling the outcome of the experiment. The factor n is an error or noise component. If this model is a true model, which accurately describes the process I am interested in, then the n's are nothing more than random variables. That is, they are realizations which follow some noise density. However, if the model is not true, then the n's involve wrong model noise in addition to randomness. I can also write this equation as Y_i = Y_pi + n_i; where Y_p is a value predicted from the model.

Model building is a process of, first, deciding which coefficients to estimate, then estimating the coefficients from experiments, followed with analyzing, verifying, and perhaps refining these estimates. In Part V I showed a manual method of estimating the coefficients in the model from the results of experiments. In Part VI I showed a more convenient method known as regression. In this installment and the next I plan to explore how a regression analysis employs a method called least squares to solve for coefficient values, and how this information helps with experiment design. Let me begin with matrix arithmetic.

I can write my model in terms of a matrix equation...


        X*p = Y

where; X is a matrix built from the independent variables (factors and levels)
       * is matrix multiplication (inner product of a matrix and a vector)
       p is a column vector containing the unknown coefficients or model parameters, and
       Y is a column vector of experimental outcomes.

The matrix X is sometimes called the design matrix for the reason that it summarizes the experiment design. For the example model I presented above it looks like this...



   | 1   x_a1  x_b1  x_a1x_b1 |
   | 1   x_a2  x_b2  x_a2x_b2 |
   | 1   x_a3  x_b3  x_a3x_b3 |
     and so forth
   | 1   x_ak  x_bk  x_akx_bk |

It contains a column for each factor and interaction in the model and a row for each level at which I intend to run an experiment. I have used variables like x_a1 in illustrating this matrix, but these are really just numbers because the x's are levels of factors like +1 or -1; or levels of physical quantities, like 20 degrees C.

The vector of coefficients looks like...


   | C₀|
   | C₁|
   | C₂|
      and so on
   | C_j-1|

It contains j unknown quantities. These are the values I hope to find through the regression analysis. And finally, the column vector Y looks like...


   | Y₁ |
   | Y₂ |
   | Y₃ |
      and so on
   | Y_k |

It contains the k results of the runs and replications of my experiment.

The matrix equation consists of k equations in j unknowns. It is possible that k<j, which means there are fewer equations than unknowns and I can find an unlimited number of sets of C's which will satisfy the matrix equation. This is not always an awkward position to be in, but in terms of analyzing experimental results it certainly is, and I won't further consider such a possibility, here.

When the mathematical model is as general as possible for a specific experimental design, then k=j--there are equal numbers of equations and unknowns. In this instance there is possibly a unique exact solution to the matrix equation. However, every experiment is contaminated with some amount of noise and this circumstance provides no means to estimate experimental noise. The solution to the matrix equation might be very sensitive to noise as well. A large condition number (ratio of largest to smallest eigenvalue) for the matrix (X) indicates this. Another unfortunate possibility is that the experimental design, or lack of one, results in experiment runs providing redundant information. In this case the matrix (X) is singular. There is no possible solution to the matrix equation unless the rank of the matrix is reduced to eliminate redundant information.

A more typical situation is that the experiment contains more information than needed to solve for just the coefficients of the model. Either the model is more simple than the experimental design allows or there are replications of the experiment. In this case k>j and it is very likely that there is no possible exact solution to the matrix equation. However, there is possibly a solution which is not exact, but which is more nearly exact than any other in the least squares sense. By the term least squares I mean that this solution will come as close as possible to an exact solution and produce the smallest sum of squared differences between experimental outcomes and predictions made using the model. There are many ways to find a least squares solution. I'll mention two of them.

The system of normal equations

The typical method of solution, which most spreadsheet programs use, is to multiply the matrix equation on the left with the transpose of the matrix X. A transpose matrix (X^T) is simply one built by interchanging the rows and columns of X. Most spreadsheet applications have a built-in transpose function. The resulting matrix equation is...


        X^T*X*p = X^T*Y

The product X^T*X is just a new matrix, but one which is square. It has j rows and j columns. Being square and (presumably) not singular, this matrix has an exact inverse. People often refer to this new matrix equation as the system of normal equations. The matrix X^T*X provides lots of useful information about a proposed experiment design. For example, it can tell me which factor levels are unusually influential in determining the coefficients. Another example is that a simple transformation of this matrix, known as centering and scaling, indicates how much each coefficient in the model is correlated with others. It is not possible to find the values of two coefficients which are highly correlated with one another. Instead, I can find only some combination of two such coefficients. There are three reasons why coefficients become correlated with one another.

One possible situation is when factor levels, or data locations, are colinear. Instead of spreading out and covering response space, the treatments lie along lines or planes or hyperplanes. This bad distribution of independent variable not only causes coefficients to become highly dependent on one another, but makes the solution to the matrix equation very susceptible to noise in the data.
It is possible to make a model overly complex and not recognize that two parameters are identical or closely related. For example, in the model Y=Bexp^c+dx, the coefficients B and c are both part of a single constant, and completely dependent on one another.
If I were to build a model with an excessive number of polynomial terms, like Y=a+bx+cx²... then eventually some of the terms are so similar to one another that they lack independence.

Therefore, I can make use of the matrix X^T*X to indicate problems like those on the list, and alter and simplify my model. I'll illustrate this in the next installment, Part X.

Once I premultiply both sides of the matrix equation by X^T, it takes a form where I can solve it using the inverse to the matrix X^T*X. Once again most spreadsheet applications have a matrix inverse function. So the solution to the normal equations is...


        (X^T*X)^-1(X^T*X)*p = (X^T*X)^-1X^T*Y

or

        p = (X^T*X)^-1X^T*Y

Singular Value Decomposition

A different method of finding the least squares solution is through singular value decomposition (SVD). The details of this method are very tiresome and a reference which provides a lucid account is Numerical recipes in 'C': The art of scientific computing by Wm. Press and others, AIP press, 1986. However, let me provide a summary.

The matrix (X) possesses an inverse only in the case that k=j, and even in that case only when all rows of the matrix are independent of one another. In the general case where j and k are not equal, X has only a generalized inverse. The SVD algorithm finds this generalized inverse by reducing X to a product of three matrices.

        X = U^T*W*V

The matrix W contains the eigenvalues of the matrix X. I can use it to obtain the condition number of the normal equation matrix, X^TX, for example. If I input the normal matrix to the SVD routine, it will provide me not only with the condition number, but in the case of a bad (large) condition number, the SVD will let me know which rows in the normal matrix are the source of my trouble. Corresponding to each eigenvalue is an eigenvector in the matrix V. These eigenvectors are basis vectors for the coefficients of the model. Any coefficient value is an evaluated linear combination of these vectors. The vector product R=V*V^T is called the resolution matrix, and shows how well the data will resolve each coefficient in the model. If R is an identity matrix, the coefficients are all well resolved.

The matrix U has to do with the factors, or independent variables, in the experiment design. The matrix product I=U^T*U is known as the information density matrix, and I can use it to decide how much independent information each of my observations supplies. If it were an identity matrix then each factor level in the experiment design supplies unique information. I can use this to decide to reduce or augment my factor levels.

The left side generalized inverse of the matrix X is one that makes the matrix product (X^-1*X) equal to an identity matrix (1). The matrix product U*U^T is an identity, as should be V^T*V. Therefore...

        X^-1 = V^T*W^-1*U

This particular inverse is built to operate on the left side of the matrix equation. A right side generalized inverse is not the same as the left side inverse.

The advantages of SVD are...

It provides an exact solution if one is available.
It provides a least squares solution if one is possible.
It eliminates eigenvectors corresponding to eigenvalues that are zero and thereby eliminates a coefficient from consideration if there is no information regarding its value.
It provides numerical eigenvalues so that I can eliminate any eigenvector having an eigenvalue so small that it is probably corrupted badly by experimental noise.
It provides a coefficient resolution matrix R = V*V^T, which tells me how well resolved the parameters of my model actually are.
It provides me an information density matrix I = U^T*U, which I can use to decide where to best obtain experimental data or observations.

The important points for experiment design here are as follows. First, I can compute the resolution matrix R or the normal matrix (X^T*X) knowing nothing more than the factor levels and interactions in my model. I can then decide if this distribution of treatments will provide me with coefficients well resolved and independent of one another. If some are poorly resolved then I can alter my model or my experimental design. I can also find the information density matrix knowing only the factor levels I plan to use. This matrix tells me which treatments possess independent information, and which are most influential.

Imagine how important information like this is before I begin performing experiments. If I find out after a run of experiments that I could have obtained more or better information at different factor levels, then I'll be doing so at some later time. This is not consistent with randomizing all of my experimental trials, which makes time an additional experimental factor. I can perform my experiments more efficiently the better my information ahead of time.

In the next installment I'll present a few hypothetical design problems to illustrate use of the normal equations and the SVD.

Reprinted from: