Design Your Experiments: Regression


			Society for Amateur Scientists Regression

Design Your Experiments Part VI: Regression

by Kevin Kilty

In the previous installment I explained full factor designs and advised on a method to analyze results manually. However, I never analyze manually any longer unless I'm forced to. A spreadsheet program is much too powerful and convenient to forego in favor of any manual analysis. There are many good reasons for learning to use a spreadsheet program and also reasons for using it often.

A spreadsheet is very fast and accurate.
Most spreadsheets have powerful statistical and probability tools and functions.
I can change my model and reanalyze conveniently. This lets me analyze what happens if I reduce or increase the number of terms in model.
It will produce publication quality graphs and tables.

In the manual method of design and analysis I scaled all of my factors into levels. Two level experiments I made a low level at -1 and high level at +1, so that the scaling was that of 2 units per difference in factor levels. This made my manual analysis very simple. But a spreadsheet can handle all the mathematical details and this lets me use natural factor units rather than scaled units. It is much more convenient to do so and in a second example I'll illustrate using natural units; but, for now I'll stay with scaled levels of +1 and -1.

An example experiment

Steel has an ultimate strength. If it is made to carry more stress than this it simply fails. My hypothetical experiment involves studying several factors which I believe are related to this. I have noticed, for example that steel parts seem to fail more at low temperture than at high temperature, fail less often if they have a finely polished surface, and quite obviously fail more often with increased service life. Thus, I plan to test for the following factors and levels. I will use the number of machine cycles to measure service life; degrees fahrenheit to measure temperature; and, average roughness (R_a) in microinches to measure surface quality. Please excuse the engineering units. The objective of the experiment (Y) is the tensile stress which causes the steel part to fail--a quantity known as ultimate strength.


Table 1.
Factors and levels for experiment

  Factor          Level +1      Level -1           Units
------------------------------------------------------------------
A Service time      10⁵           10⁶                Cycles
B Temperature       50             0            Degrees Fahrenheit
C Surface finish    1-6          12-25          R_a microinches
------------------------------------------------------------------

This table shows one advantage in using scaled levels. The surface finish factor is very difficult for me to control precisely. It varies quite a bit over the surface of the part. For purposes of a screening experiment, though, I simply separate my samples into two groups--one with a rough finish and one with a "super" finish. This would be very difficult for me to input to a typical regression analysis which would require an exact value for each factor level.

I assume for the time being that only two-factor interactions have any importance; so my contrasts table won't include a column ABC for the sole possible three-factor interaction. Also, I plan to replicate some of my experiments by performing the same experiment twice for some treatments, but not all. This illustrates another advantage to using a regression because the varying numbers of samples at each level of each factor makes analysis of variance using a 3-way table difficult to do. Regression analysis organizes the explanation of variation in the data into a neat single form due to the model, known as variation due to regression, which is (nearly) free from influence of the numbers of replications at each treatment. My contrasts table below, also summarizes output data, the results of the experiments, in PSI tensile strength.


Contrasts Table
Factor Levels and Interactions
I     A      B      C     AB     AC     BC            Y   
-------------------------------------------------------------
1    -1     -1      -1    +1     +1     +1    I     73,000;71,900
1    +1     -1      -1    -1     -1     +1    A    106,400      
1    -1     +1      -1    -1     +1     -1    B     86,270
1    +1     +1      -1    +1     -1     -1    AB   112,000      
1    -1     -1      +1    +1     -1     -1    C     91,600;93,890
1    +1     -1      +1    -1     +1     -1    AC   114,000      
1    -1     +1      +1    -1     -1     +1    CB   123,000;118,000
1    +1     +1      +1    +1     +1     +1         138,000;135,100
-------------------------------------------------------------

I organize the experiment as follows. I first gather 12 samples of my part and organize them into two groups on the basis of surface finish. I had only 5 of rough finish, but I had 7 of super finish. I then separated each of these groups into two subgroups. One subgroup I assign to be put through a loading cycle 1 million times and the other for 100,000 times. I made the assignment on the basis of a coin flip. After all the samples were prepared in this way I decided, with a coin flip again, to assign each sample to be tested for ultimate strength at 50F or at 0F.

Now it is time to analyze this data. I use Excel for this, but most spreadsheet programs provide similar capability. If you plan to use Excel, however, you will need to make sure the "Data Analysis Toolkit" is installed. Go to the "Tools" menu item, and if there is no option entitled "Data Analysis" then go the the "Add Ins..." item and check the "Data Analysis Toolkit." Don't bother adding the VBA extension to this unless you are planning to write Visual Basic scripts at some future time. The initial input to the spreadsheet looks almost like the contrasts table, except there is no need to add the 'I' column, the factors column, and there will be 12 rows rather than just 8. There will be a row for each experiment result. The table below shows how the spreadsheet will look...


Column A   B   C   D   E   F     G
row
 
1      A   B   C   AB  AC  BC     Y
2     -1  -1  -1   1   1   1     73000
3     -1  -1  -1   1   1   1     71900
4      1  -1  -1  -1  -1   1    106400
5     -1   1  -1  -1   1  -1     86270
6      1   1  -1   1  -1  -1    112000
7     -1  -1   1   1  -1  -1     93890
8     -1  -1   1   1  -1  -1     91600
9      1  -1   1  -1   1  -1    114000
10    -1   1   1  -1  -1   1    123000
11    -1   1   1  -1  -1   1    118000
12     1   1   1   1   1   1    138000
13     1   1   1   1   1   1    135100

With the data so organized I now choose the following sequence of options in Excel: Tools->DataAnalysis...->Regression. I am now faced with a dialog box on which to make entries. The dependent variable is $G$2:$G$13, and the dependent variable(s) $A$2:$F$13. Excel will recognize that I have included 6 contiguous columns as dependent variable, and will treat this as six separate variables. Excel has no idea what these mean, of course, but it will keep track of their order. Next I have to check a series of options for output. For example, I asked for residual output, standardized residuals, and graphs of residual. I also asked for the output to go to a separate workbook. Excel balks at graphical output if you don't request this and I don't understand why. I won't present any of the graphs, but the output which Excel generated follows.



SUMMARY OUTPUT                                                  
                                                        
Regression Statistics                                                   
Multiple R 0.998009435                                          
R Square   0.996022832                                          
Adjusted R Square  0.991250231                                          
Standard Error   2071.090896                                            
Observations    12

This first section provides over-all measures of how well the model fit the data. In this case R square, also known as goodness of fit, and Adjusted R Square, are all I am interested in. These are unusually high for experimental data, and indicate that the model explains more than 99% of the variation in the data. The next section is the analysis of variance.


                                                        
ANOVA                                                   
Source      df      SS          MS           F        Significance F            
Regression   6   5371105779  895184296.5  208.6960051  7.81114E-06              
Residual     5   21447087.5  4289417.5                          
Total       11   5392552867

Analysis of variance looks at only two classes of variation, and this might be restrictive in some instances which I will discuss before the next example. These two classes it does consider are the regression, i.e. the model, and residual error. If the model is a good one, then the mean square residual (MS column) is an estimate of the variance from noise in the experiment. The ratio of mean square of the regression and mean square residual, the F ratio, is a measure of how significant is the model fit. You'll notice it is extremely significant, in that the probability of having obtained such a large F (208.69) by chance alone is only 7.8 in a million. R squared measure already foresaw that the model would score as highly significant. The next section of output lists results for each coefficient of the model.

                                                   
       Coefficients            Standard Error   t Stat    P-value     Lower 95%    Upper 95%
Intercept         105114.375    634.1394883   165.759075 1.51617E-10 103484.2702  106744.479
X Variable 1      12185.9375    625.2699677    19.489081 6.56408E-06  10578.6325   13793.242
X Variable 2        8715.625    634.1394883    13.744018 3.65942E-05   7085.5202   10345.729
X Variable 3      10897.1875    625.2699677    17.427972 1.13991E-05   9289.8825   12504.492
X Variable 4      -1615.3125    625.2699677    -2.583384 0.049229782  -3222.6174  -8.0075049
X Variable 5       -2796.875    634.1394883    -4.410504 0.006953298  -4426.9797  -1166.7702
X Variable 6       3797.8125    625.2699677     6.073876 0.001747633   2190.5075   5405.117

X Variable 1 corresponds to the 'A' factor, and X Variable 6 corresponds to the 'AC' interaction. The coefficients themselves are in units of PSI per 2 factor levels. Notice that the standard error for each of the coefficients repeats for many factors. The t Stat column indicates the calculated Student's t statistic for the null hypothesis that each coefficient is actually zero. In each case except variable 4, the t is larger than 3 in magnitude, which means that all of these coefficients, except possibly one for the 'AB' interaction, are definitely not zero. Only the 'AB' interaction coefficient is possibly zero. For the'AB' coefficient, the probability of obtaining a larger magnitude for Student's t by random noise alone is 0.049, or 4.9%, which is still quite unlikely. The last two columns provide the 95% confidence interval for each coefficient. For instance, the 95% confidence interval for the coefficient of the 'CB' interaction is between 2190.5 and 5405.1 PSI per factor level squared. The last bit of output from Excel is an analysis of residual.


RESIDUAL OUTPUT         
                                                        
Observation  Predicted Y   Residuals   Standard Residuals       
1             72701.25       298.75      0.213953845    
2             72701.25      -801.25     -0.573826001    
3             105897.5        502.5      0.359872157    
4              85767.5        502.5      0.359872157
5             112502.5       -502.5     -0.359872157
6             92493.75      1396.25      0.999943282    
7             92493.75      -893.75     -0.640071125    
8             114502.5       -502.5     -0.359872157    
9            120751.25      2248.75      1.610472662    
10           120751.25     -2751.25     -1.970344819    
11           136298.75      1701.25      1.218373148    
12           136298.75     -1198.75     -0.858500991

This table shows the prediction made using the model compared to the error of this prediction (the difference between prediction and observation) for each experiment. You'll notice that the worst residual is -2751.25 PSI. This is the magnitude of the largest error I might expect to encounter in using this model to predict tensile strength. Is -2751.25 a unexpected amount of error? The only way to answer this is to examine the column of standard residuals. Here the residuals are scaled according to the standard error of prediction. In this column the residuals are turned into random deviates, and so -2751.25 PSI turns out to be a random deviate of -1.97, which is large, but not suspiciously so. If you think of it as a normal random deviate then a residual of this magnitude or larger might be expected something like 5% of the time.

This series of experiments have shown me which factors are most in control of my objective. I can do several things with the results. I can...

Follow the direction of most rapid increase in strength to another region of factor levels to run more screening tests.
Make predictions of ultimate strength on any given sample and therefore identify products that underperform.
Provide guidance on how these parts should be used and maintained.

An example from biology

It is going to seem very long-winded to offer an second example, but sometimes people respond better to examples they are interested in, and this one comes from biology. It presents some interesting issues of its own. The problem consists of finding how two factors, storage temperature and storage time, affect the amount of vitamin C in cut beans. The table below shows the raw data, and this time I have decided to use actual units rather than just factor levels. Temperature (F) is taken at three levels, storage time (weeks) is taken at four, and I have added a product column in order to model the interaction between temperature and storage time. The experiment outcome is the amount of vitamin C in the beans as indicated by ascorbic acid concentration in mg/g.


Temp(F) Storage Product   Vitamin C
0       2         0          45
0       4         0          47
0       6         0          46
0       8         0          46
10      2        20          45
10      4        40          43
10      6        60          41
10      8        80          37
20      2        40          34
20      4        80          28
20      6       120          21
20      8       160          16

Obviously material for the experimental units, the beans, came from plots in fields, which occasionally provides unique experiment design problems. For instance, the beans which various fields produce, in fact the beans from various plots in a field, and beans from different seasons are not comparable with one another. As a result, if an experiment is to run over several seasons and done in many different plots, then these are additional factors to consider, and they have to be added as separate columns of their own in the analysis. Experiments done with perennials to have to consider the factor of season for this reason. In this case I was lucky enough to find data from a single season and a single plot. I did the analysis using the regression tool available in Excel again. Rather than explain each section of output, I'll just show the entire summary at one time and make a few remarks. Those of you who are interested can take this same data and perform the analysis yourself.



SUMMARY OUTPUT                                                          
                                                                
Regression Statistics                                                           
Multiple R  0.956615                                                    
R Square    0.915112                                                    
Adjusted R Square   0.88328                                                     
Standard Error    3.60815                                                       
Observations   12                                                       
                                                                
ANOVA                                                           
Source      df     SS         MS           F     Significance F                 
Regression   3   1122.767  374.2556    28.74743     0.000123                    
Residual     8    104.15    13.0187                                     
Total       11   1226.917                                               
                                                                
      Coefficients    Standard Error  t Stat   P-value   Lower 95%   Upper 95%
Intercept    47.25     4.034035    11.71284 2.58E-06    37.94749        56.55251
X Variable 1 -0.275    0.312475    -0.88007 0.404482    -0.99557        0.445569
X Variable 2  0.158    0.736511     0.21497 0.835164    -1.54006        1.856731
X Variable 3 -0.1575   0.05705     -2.76074 0.024646    -0.28906        -0.02594

                                                                
                                                                
RESIDUAL OUTPUT
                                        
Observation  Predicted Y   Residuals   Standard Residuals
1             47.56667      -2.56667    -0.83413
2             47.88333      -0.88333    -0.28707
3             48.2          -2.2        -0.71497
4             48.51667      -2.51667    -0.81789
5             41.66667       3.333333    1.08329
6             38.83333       4.166667    1.35411
7             36             5           1.62493
8             33.16667       3.833333    1.24578
9             35.76667      -1.76667    -0.57414
10            29.78333      -1.78333    -0.57956
11            23.8          -2.8        -0.90997
12            17.81667      -1.81667    -0.59039

This output summary indicates several interesting things. First, the model explains the data quite well. The R Squared value suggests that it explains 91% of the variation in the experiments. The regression itself is also quite significant as the large value of F ratio shows. While the regression found coefficients different than zero for each term of the model, the small magnitude of t Stat corresponding to coefficients for X Variable 1 (Temperature) and X Variable 2 (Storage time) (<1) suggests that there is no strong evidence they these coefficients are not actually zero. You will notice that zero is included in the 95% confidence intervals for both. However, the interaction term (X Variable 3) is highly significant. From this I would conclude that a reasonable simple model of Vitamin C retention with storage time is...



Y = 47.5 - 0.1575*T*W + n

where T is storage temperature in F, and W is storage time in weeks. The coefficient 47.5 corresponds to just this season's beans and just those grown in this plot, until such time as further experiments show that it applies more widely. In the next installment I'll cover the subject of fractional experimental designs, which help save time and money in expensive programs of experiments.

Reprinted from: