Design Your Experiments: Model Building


			Society for Amateur Scientists Model Building

Sponsored by:

Design Your Experiments Part X: Model Building

By Kevin Kilty

Editor's Note: Yes, I know that last week we ran Part VII, and I know that X does not normally follow VII. It's what's known in the publishing industry as a screw-up. Mea culpa. We'll run the intervening articles in the coming weeks. Meanwhile, we appreciate your indulgence. -SG

I am not going to solve any regression analyses in this installment nor build any models using regression, nor will I discuss any of its complexities and nuances. This is all more appropriate to a series on data analysis. However, the design matrix (X) which is involved in building a model using least squares analysis, is very useful for design, and I'd like to show how to use it. But first, I'll make a few remarks about models.

Transforming models

The best models are always the simplest. If models become too complex they rapidly lose their power to explain. Never include too many terms in a model or soon you will find yourself fitting some of its coefficients to experimental noise. The example model I will use throughout this installment is linear. It contains only factors raised to the first power, and perhaps interactions of these terms. Some people insist that by adding interaction terms such as xy to a model that I must also add a term like x² to remain consistent. This is bunk. Let the physics of the problem decide what terms to consider, and let the data decide which are significant.

The simplest model to consider is a linear one like...


     Y_i = C₀ + C₁x_i

This may appear to be a badly limited model, but, in fact, it is quite useful as I can transform many a non-linear 
 model into a linear one. For instance, the model which best describes Newton's  law of cooling is

     T(t) = Dexp^-kt

where; T is temperature, a function of time (t), D is equal to the initial temperature difference between the body and its surroundings; and, k is a time constant related to the heat transfer film at the surface of the body. I make this a linear model by taking the logarithm of both sides.


     Log(T(t)) = Log(D) - kt

I can transform many other equations to a linear form in a similar way. By the way, despite my transform of Newton's law to a linear form, the optimum value of the coefficients which I find through regression now actually refers to the transformed model, not necessarily to the original. Keep this in mind when returning to the original form.

Bias of estimates

In a lab note last summer I described the idea of a differential capacitor. I used this to measure relative permittivity of PVC, but I mentioned at the time that I could also use it to determine the speed of light. This provides an example of using propagation of error to help guide design, and it also provides an example of how to analyze for bias.

The first thing I need to do is analyze the precision required of my measurements. The relationship between permittivity and speed of light is c²=1/me. The permeability of free space (m) is defined as 4px10^-7, so it has no uncertainty at all. Therefore my propagation of error shows that


u_c² =  0.25*(u_ec/e)²

The speed of light is presently known to a precision of 1 part in 100 million or so. If I hope to measure it with the same relative uncertainty using my method, then I need to measure e to about half this same relative precision. Is it possible to measure e to this precision? I don't know, but if not then this experiment will never advance our knowledge at all (see the conclusion below). However, at this point I would naturally proceed to perform a propagation of error analysis on obtaining e from the equation for a capacitor, and one useful thing I would obtain from this is that I can make the measured result insensitive to errors in the radii of the inner and outer conductors of the capacitor. However, I think it is more important to present an analysis which I haven't spoken of at all in this series on DoE.

Analyzing for bias is something many people might overlook in their experiment design. What I will do is determine if the value I calculate as the speed of light using (1/me^1/2) is actually equal to the speed of light. I know this sounds weird, but I need to know if noise is going to add a bias to my calculation. I am going to examine the expected value which will result from my experimental method in the presence of noise. I have to guess at what the noise will be like. I assume only that it will have an expected value of zero and a variance that is not zero.

Let d represent a small difference between the true value of permittivity (e₀) and what I measure in my experiment. I will expect to compute a speed of light (c) equal to...


   E(c)= E((me)^-1/2) with e=e₀+d

or using the linear properties of expected value...


   E(c)= (me₀)^-1/2E((1+d/e₀)^-1/2)

 or

   E(c)= cE((1+d/e₀)^-1/2)

Therefore, the expected value of my experiment is the speed of light multiplied by some quantity. Is the expected value of this quantity 1? Any reasonable experiment would result in d being much smaller than e itself. So, I can expand the argument of expected value with a Taylor series or a binomial series through second order, and obtain


   E(c)=c*(E(1)-0.5*E(d)/e+0.75*E(d²)/e²)

or because E(1)=1, E(d)=0 and E(d²) is the variance of d...


   E(c)=c*(1+0.75*Variance(d)/e²)

From this I see that the expected value of the speed of light that I will obtain from this method is biased above the true value by an amount depending on the variance of d. However, my propagation of error showed that I needed relative uncertainty in e₀ to be less than one part in 10⁹ anyway. This makes Variance(d)/e² about one part in 10¹⁸, which is insignificantly small. So, despite my estimate being biased, the bias is too small to ever notice.

How would I set up a model for the differential capacitor? The theoretical capacitance of a pair of concentrically cylindrical tubes with length of L and ratio of outer to inner radii of (a/b) is...



   C = 2peL/Ln(a/b)

However, in reality this is inaccurate due to the fringing of the electrical field at the terminating ends of the coaxial tubes and the leads which connect the capacitor to the measuring system. The fringing field and connectors behave as a capacitor placed in parallel with the actual capacitor. Luckily I can build a series of coaxial capacitors of varying length, but which have a geometrically identical fringing field. So for any such coaxial capacitor of equal radial geometry I can write.



   C_actual = C_ideal+ C_fringe
or
   C_actual = 2peL/Ln(a/b) + C_fringe
or
   C_actual = e*f(a/b,L) + C_fringe

where f(a/b,L) is a geometrical factor with dimensions of length. Notice that this is a linear model. I would proceed to build several capacitors with different values of L, and then find the slope of the best fitting straight line of capacitance against f(a/b,L).

What factors should I use in a model?

For the remaining examples I propose to return to Part VI, where I was experimenting with the cold storage of snap beans. In that example I decided to use storage time and storage temperature as factors; and, to consider the interaction between them. I have said that I should design experiments knowing as much as possible about something in advance. Does it make sense to include both storage time and temperature as primary factors?

Consider the physical process of how beans loose ascorbic acid. There are two ways this could happen. Either the ascorbic acid becomes some other chemical, or the ascorbic acid diffuses from the beans. In either case I am looking at a process controlled by a rate. Because rate is the determining factor, I really ought to consider only factors that have units of time in them to integrate this rate and turn it into an amount. Therefore, storage time, and the interaction of storage time and temperature, seem like reasonable factors because they involve time, whereas temperature does not. Temperature does not have the units I expect of a proper factor. In fact, you can think about this another way. Suppose I freeze the beans, I measure their temperature while they remain at 20F briefly, and then lower them to 0F. Should I record 20F as the factor level for this run? Of course not. It is the temperature of the majority of the storage time that is important.

This makes good sense to me, but back in Part VI, when I did the regression analysis of results of my snap bean experiments, I found that temperature was the second most significant factor. It was actually more significant than storage time, in fact. Why was this so? It turns out that in the experiment design, temperature and the interaction term are very highly correlated with one another. The regression analysis, keep in mind, is simply a stupid calculator. It doesn't understand any physics or chemistry. Thus, any two factors that happen to look alike in the design equation will appear similarly significant in the regression.

Consider the design equation (the matrix X) for the snap bean storage experiment. It is...


    | 1   0   2    0 |
    | 1   0   4    0 |
    | 1   0   6    0 |
    | 1   0   8    0 |
    | 1  10   2   20 |
    | 1  10   4   40 |
    | 1  10   6   60 |
    | 1  10   8   80 |
    | 1  20   2   40 |
    | 1  20   4   80 |
    | 1  20   6  120 |
    | 1  20   8  160 |

The first column of 1s corresponds to the constant in the model, the next three columns correspond to storage temperature, storage time, and the product respectively. Note that the temperature column and the product column have the same pattern. This is the source of their correlation. I'll center this design matrix by calculating the average value of each column, and then subtract this average value from each column entry. Let me call the resulting matrix C. It looks like...


    | 0 -10  -3  -50 |
    | 0 -10  -1  -50 |
    | 0 -10   1  -50 |
    | 0 -10   3  -50 |
    | 0   0  -3  -30 |
    | 0   0  -1  -10 |
    | 0   0   1   10 |
    | 0   0   3   30 |
    | 0  10  -3  -10 |
    | 0  10  -1   30 |
    | 0  10   1   70 |
    | 0  10   3  110 |

Now I'll form the matrix product C^TC. Then I'll scale the matrix by dividing each element c_ij by the square root of the product of two diagonal terms sqrt(c_ii*c_jj). This makes the diagonal terms all 1 (yes, even the term that started out zero). The resulting matrix is the correlation matrix (sometimes named R, but this notation is easily confused with the coefficient resolution matrix from SVD). In this case it is...


   | 1   0   0   0 |
   | 0   1   0  .82|
   | 0   0   1  .45|
   | 0  .82 .45  1 |

The correlation coefficient between storage time and the interaction is 0.45 which is not especially large, but the correlation between temperature and the interaction is 0.82 which is significant. There is no way the regression itself can straighten out which of the two highly correlated factors is the more important, and all I can do is remove temperature as a factor in advance because it does not fit the physics of the problem. There is no substitute for knowledge in this case.

Using the design matrix or SVD to analyze a design

One thing I'd like to know ahead of time is whether any of the treatments I've chosen to measure is unusually influential. A particularly influential treatment will introduce a large amount of error into the model if its measurement so happens to be corrupted by noise or error. The way to measure influence of a treatment is through what statisticians call the hat matrix, H, defined as H = X*(X^TX)^-1*X^T. The diagonal terms of this matrix, there will be 12 of them in this case, turn out to be values between 0 and 1. These correspond to my treatments in the order they appear in the design matrix. These are also values that Excel or any other spreadsheet uses to calculate the "standardized or studentized residuals." A value near 0 indicates the treatment is not influential at all in the analysis, while a value of 1 indicates an infinitely influential value. Statisticians often refer to treatments with values near 1 as having a lot of leverage over the resulting fit of the data. In the case of my example here the values are...


(.58, .25, .25, .58, .23, .10, .10, .23, .58, .25, .25, .58)

The two least influential treatments are those for temperature=10F and storage time of 4 and 6 weeks. The most influential treatments are the 4 extreme treatments with temperature and storage time equal to (0,2),(0,8),(20,2), and (20,8) respectively. If I am worried about experimental noise affecting my results, and I have limited resources to perform experiments, then I will forego runs and replications at least influential treatments in favor of additional replications at the most influential treatments.

It is interesting to see how SVD would calculate the information density matrix (I) in this case. I used Mathematica to calculate the SVD, but any other similar program will do. The information density matrix (I=U^TU) has 12x12 components, which is too large to be informative in this example. However, the diagonal components of the matrix are...


(.58, .25, .25, .58, .23, .10, .10, .23, .58, .25, .25, .58)

which are the elements of the hat matrix again. Thus, the SVD supplies, in practially one step, the same information that takes me a lot of messing about with the design matrix X.

Conclusions

There is a huge amount to learn about regression analysis, least squares, and SVD. Only experience will help you learn it. However, I hope that I have shown how much information about experiment design is available through manipulations of the design matrix X.

By the way, in closing I would like to make an interesting observation. My example about using differential capacitors to measure the speed of light appears to be a bit of circular reasoning. At least it shows how easily circular reasoning enters a problem. The present definition of the meter is based on the distance light travels in a short length of time. My example experiment uses a geometric factor f(a/b,L) determined from a standard meter. Thus, I'll be using the speed of light implicitly to find the speed of light. Surely this is a poor situation to be in. Can anyone suggest whether or not the results are meaningful, or are they preordained?

Reprinted from: