| |
|||
|
|
|
Related Products: Society for Amateur Scientists
|
Sponsored by:
Design Your Experiments Part VI: Regressionby Kevin Kilty In the previous installment I explained full factor designs and advised on a method to analyze results manually. However, I never analyze manually any longer unless I'm forced to. A spreadsheet program is much too powerful and convenient to forego in favor of any manual analysis. There are many good reasons for learning to use a spreadsheet program and also reasons for using it often.
In the manual method of design and analysis I scaled all of my factors into levels. Two level experiments I made a low level at -1 and high level at +1, so that the scaling was that of 2 units per difference in factor levels. This made my manual analysis very simple. But a spreadsheet can handle all the mathematical details and this lets me use natural factor units rather than scaled units. It is much more convenient to do so and in a second example I'll illustrate using natural units; but, for now I'll stay with scaled levels of +1 and -1. An example experimentSteel has an ultimate strength. If it is made to carry more stress than this it simply fails. My hypothetical experiment involves studying several factors which I believe are related to this. I have noticed, for example that steel parts seem to fail more at low temperture than at high temperature, fail less often if they have a finely polished surface, and quite obviously fail more often with increased service life. Thus, I plan to test for the following factors and levels. I will use the number of machine cycles to measure service life; degrees fahrenheit to measure temperature; and, average roughness (Ra) in microinches to measure surface quality. Please excuse the engineering units. The objective of the experiment (Y) is the tensile stress which causes the steel part to fail--a quantity known as ultimate strength. Table 1. Factors and levels for experiment Factor Level +1 Level -1 Units ------------------------------------------------------------------ A Service time 105 106 Cycles B Temperature 50 0 Degrees Fahrenheit C Surface finish 1-6 12-25 Ra microinches ------------------------------------------------------------------ This table shows one advantage in using scaled levels. The surface finish factor is very difficult for me to control precisely. It varies quite a bit over the surface of the part. For purposes of a screening experiment, though, I simply separate my samples into two groups--one with a rough finish and one with a "super" finish. This would be very difficult for me to input to a typical regression analysis which would require an exact value for each factor level. I assume for the time being that only two-factor interactions have any importance; so my contrasts table won't include a column ABC for the sole possible three-factor interaction. Also, I plan to replicate some of my experiments by performing the same experiment twice for some treatments, but not all. This illustrates another advantage to using a regression because the varying numbers of samples at each level of each factor makes analysis of variance using a 3-way table difficult to do. Regression analysis organizes the explanation of variation in the data into a neat single form due to the model, known as variation due to regression, which is (nearly) free from influence of the numbers of replications at each treatment. My contrasts table below, also summarizes output data, the results of the experiments, in PSI tensile strength. Contrasts Table Factor Levels and Interactions I A B C AB AC BC Y ------------------------------------------------------------- 1 -1 -1 -1 +1 +1 +1 I 73,000;71,900 1 +1 -1 -1 -1 -1 +1 A 106,400 1 -1 +1 -1 -1 +1 -1 B 86,270 1 +1 +1 -1 +1 -1 -1 AB 112,000 1 -1 -1 +1 +1 -1 -1 C 91,600;93,890 1 +1 -1 +1 -1 +1 -1 AC 114,000 1 -1 +1 +1 -1 -1 +1 CB 123,000;118,000 1 +1 +1 +1 +1 +1 +1 138,000;135,100 -------------------------------------------------------------I organize the experiment as follows. I first gather 12 samples of my part and organize them into two groups on the basis of surface finish. I had only 5 of rough finish, but I had 7 of super finish. I then separated each of these groups into two subgroups. One subgroup I assign to be put through a loading cycle 1 million times and the other for 100,000 times. I made the assignment on the basis of a coin flip. After all the samples were prepared in this way I decided, with a coin flip again, to assign each sample to be tested for ultimate strength at 50F or at 0F. Now it is time to analyze this data. I use Excel for this, but most spreadsheet programs provide similar capability. If you plan to use Excel, however, you will need to make sure the "Data Analysis Toolkit" is installed. Go to the "Tools" menu item, and if there is no option entitled "Data Analysis" then go the the "Add Ins..." item and check the "Data Analysis Toolkit." Don't bother adding the VBA extension to this unless you are planning to write Visual Basic scripts at some future time. The initial input to the spreadsheet looks almost like the contrasts table, except there is no need to add the 'I' column, the factors column, and there will be 12 rows rather than just 8. There will be a row for each experiment result. The table below shows how the spreadsheet will look... Column A B C D E F G row 1 A B C AB AC BC Y 2 -1 -1 -1 1 1 1 73000 3 -1 -1 -1 1 1 1 71900 4 1 -1 -1 -1 -1 1 106400 5 -1 1 -1 -1 1 -1 86270 6 1 1 -1 1 -1 -1 112000 7 -1 -1 1 1 -1 -1 93890 8 -1 -1 1 1 -1 -1 91600 9 1 -1 1 -1 1 -1 114000 10 -1 1 1 -1 -1 1 123000 11 -1 1 1 -1 -1 1 118000 12 1 1 1 1 1 1 138000 13 1 1 1 1 1 1 135100With the data so organized I now choose the following sequence of options in Excel: Tools->DataAnalysis...->Regression. I am now faced with a dialog box on which to make entries. The dependent variable is $G$2:$G$13, and the dependent variable(s) $A$2:$F$13. Excel will recognize that I have included 6 contiguous columns as dependent variable, and will treat this as six separate variables. Excel has no idea what these mean, of course, but it will keep track of their order. Next I have to check a series of options for output. For example, I asked for residual output, standardized residuals, and graphs of residual. I also asked for the output to go to a separate workbook. Excel balks at graphical output if you don't request this and I don't understand why. I won't present any of the graphs, but the output which Excel generated follows. SUMMARY OUTPUT Regression Statistics Multiple R 0.998009435 R Square 0.996022832 Adjusted R Square 0.991250231 Standard Error 2071.090896 Observations 12This first section provides over-all measures of how well the model fit the data. In this case R square, also known as goodness of fit, and Adjusted R Square, are all I am interested in. These are unusually high for experimental data, and indicate that the model explains more than 99% of the variation in the data. The next section is the analysis of variance. ANOVA Source df SS MS F Significance F Regression 6 5371105779 895184296.5 208.6960051 7.81114E-06 Residual 5 21447087.5 4289417.5 Total 11 5392552867Analysis of variance looks at only two classes of variation, and this might be restrictive in some instances which I will discuss before the next example. These two classes it does consider are the regression, i.e. the model, and residual error. If the model is a good one, then the mean square residual (MS column) is an estimate of the variance from noise in the experiment. The ratio of mean square of the regression and mean square residual, the F ratio, is a measure of how significant is the model fit. You'll notice it is extremely significant, in that the probability of having obtained such a large F (208.69) by chance alone is only 7.8 in a million. R squared measure already foresaw that the model would score as highly significant. The next section of output lists results for each coefficient of the model. Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept 105114.375 634.1394883 165.759075 1.51617E-10 103484.2702 106744.479 X Variable 1 12185.9375 625.2699677 19.489081 6.56408E-06 10578.6325 13793.242 X Variable 2 8715.625 634.1394883 13.744018 3.65942E-05 7085.5202 10345.729 X Variable 3 10897.1875 625.2699677 17.427972 1.13991E-05 9289.8825 12504.492 X Variable 4 -1615.3125 625.2699677 -2.583384 0.049229782 -3222.6174 -8.0075049 X Variable 5 -2796.875 634.1394883 -4.410504 0.006953298 -4426.9797 -1166.7702 X Variable 6 3797.8125 625.2699677 6.073876 0.001747633 2190.5075 5405.117X Variable 1 corresponds to the 'A' factor, and X Variable 6 corresponds to the 'AC' interaction. The coefficients themselves are in units of PSI per 2 factor levels. Notice that the standard error for each of the coefficients repeats for many factors. The t Stat column indicates the calculated Student's t statistic for the null hypothesis that each coefficient is actually zero. In each case except variable 4, the t is larger than 3 in magnitude, which means that all of these coefficients, except possibly one for the 'AB' interaction, are definitely not zero. Only the 'AB' interaction coefficient is possibly zero. For the'AB' coefficient, the probability of obtaining a larger magnitude for Student's t by random noise alone is 0.049, or 4.9%, which is still quite unlikely. The last two columns provide the 95% confidence interval for each coefficient. For instance, the 95% confidence interval for the coefficient of the 'CB' interaction is between 2190.5 and 5405.1 PSI per factor level squared. The last bit of output from Excel is an analysis of residual. RESIDUAL OUTPUT Observation Predicted Y Residuals Standard Residuals 1 72701.25 298.75 0.213953845 2 72701.25 -801.25 -0.573826001 3 105897.5 502.5 0.359872157 4 85767.5 502.5 0.359872157 5 112502.5 -502.5 -0.359872157 6 92493.75 1396.25 0.999943282 7 92493.75 -893.75 -0.640071125 8 114502.5 -502.5 -0.359872157 9 120751.25 2248.75 1.610472662 10 120751.25 -2751.25 -1.970344819 11 136298.75 1701.25 1.218373148 12 136298.75 -1198.75 -0.858500991This table shows the prediction made using the model compared to the error of this prediction (the difference between prediction and observation) for each experiment. You'll notice that the worst residual is -2751.25 PSI. This is the magnitude of the largest error I might expect to encounter in using this model to predict tensile strength. Is -2751.25 a unexpected amount of error? The only way to answer this is to examine the column of standard residuals. Here the residuals are scaled according to the standard error of prediction. In this column the residuals are turned into random deviates, and so -2751.25 PSI turns out to be a random deviate of -1.97, which is large, but not suspiciously so. If you think of it as a normal random deviate then a residual of this magnitude or larger might be expected something like 5% of the time. This series of experiments have shown me which factors are most in control of my objective. I can do several things with the results. I can...
It is going to seem very long-winded to offer an second example, but sometimes people respond better to examples they are interested in, and this one comes from biology. It presents some interesting issues of its own. The problem consists of finding how two factors, storage temperature and storage time, affect the amount of vitamin C in cut beans. The table below shows the raw data, and this time I have decided to use actual units rather than just factor levels. Temperature (F) is taken at three levels, storage time (weeks) is taken at four, and I have added a product column in order to model the interaction between temperature and storage time. The experiment outcome is the amount of vitamin C in the beans as indicated by ascorbic acid concentration in mg/g. Temp(F) Storage Product Vitamin C 0 2 0 45 0 4 0 47 0 6 0 46 0 8 0 46 10 2 20 45 10 4 40 43 10 6 60 41 10 8 80 37 20 2 40 34 20 4 80 28 20 6 120 21 20 8 160 16Obviously material for the experimental units, the beans, came from plots in fields, which occasionally provides unique experiment design problems. For instance, the beans which various fields produce, in fact the beans from various plots in a field, and beans from different seasons are not comparable with one another. As a result, if an experiment is to run over several seasons and done in many different plots, then these are additional factors to consider, and they have to be added as separate columns of their own in the analysis. Experiments done with perennials to have to consider the factor of season for this reason. In this case I was lucky enough to find data from a single season and a single plot. I did the analysis using the regression tool available in Excel again. Rather than explain each section of output, I'll just show the entire summary at one time and make a few remarks. Those of you who are interested can take this same data and perform the analysis yourself. SUMMARY OUTPUT Regression Statistics Multiple R 0.956615 R Square 0.915112 Adjusted R Square 0.88328 Standard Error 3.60815 Observations 12 ANOVA Source df SS MS F Significance F Regression 3 1122.767 374.2556 28.74743 0.000123 Residual 8 104.15 13.0187 Total 11 1226.917 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept 47.25 4.034035 11.71284 2.58E-06 37.94749 56.55251 X Variable 1 -0.275 0.312475 -0.88007 0.404482 -0.99557 0.445569 X Variable 2 0.158 0.736511 0.21497 0.835164 -1.54006 1.856731 X Variable 3 -0.1575 0.05705 -2.76074 0.024646 -0.28906 -0.02594 RESIDUAL OUTPUT Observation Predicted Y Residuals Standard Residuals 1 47.56667 -2.56667 -0.83413 2 47.88333 -0.88333 -0.28707 3 48.2 -2.2 -0.71497 4 48.51667 -2.51667 -0.81789 5 41.66667 3.333333 1.08329 6 38.83333 4.166667 1.35411 7 36 5 1.62493 8 33.16667 3.833333 1.24578 9 35.76667 -1.76667 -0.57414 10 29.78333 -1.78333 -0.57956 11 23.8 -2.8 -0.90997 12 17.81667 -1.81667 -0.59039This output summary indicates several interesting things. First, the model explains the data quite well. The R Squared value suggests that it explains 91% of the variation in the experiments. The regression itself is also quite significant as the large value of F ratio shows. While the regression found coefficients different than zero for each term of the model, the small magnitude of t Stat corresponding to coefficients for X Variable 1 (Temperature) and X Variable 2 (Storage time) (<1) suggests that there is no strong evidence they these coefficients are not actually zero. You will notice that zero is included in the 95% confidence intervals for both. However, the interaction term (X Variable 3) is highly significant. From this I would conclude that a reasonable simple model of Vitamin C retention with storage time is... Y = 47.5 - 0.1575*T*W + nwhere T is storage temperature in F, and W is storage time in weeks. The coefficient 47.5 corresponds to just this season's beans and just those grown in this plot, until such time as further experiments show that it applies more widely. In the next installment I'll cover the subject of fractional experimental designs, which help save time and money in expensive programs of experiments. Reprinted from:
|