Analysis of Variance


			Society for Amateur Scientists Analysis of Variance

Analysis of Variance

by Kevin Kilty

Recently I introduced the concept of Analysis of Variance to students in an Engineering Metrology course, but this powerful statistical method is also a valuable part of any scientist's data analysis toolkit. It is especially useful in the biological sciences.

The idea behind analysis of variance is quite simple. If a large body of data represents samples obtained from a single population; then various ways of classifying the data consititute merely different ways of measuring the same population characteristic--namely the population variance. If, on the other hand, the classification of the data actually divides the data into samples from different populations; then that should be apparent through a comparison of the variance of each. It is easier to refer to an example rather than try to explain the process.

The table below summarizes data from an experiment involving corms, which are the propagating part of gladiola radicals. The gladiolas which grew from these corms produced varying numbers of florets, and the purpose of the experiment was to determine if the corms found on the high part of the gladiola bulb, which are first year corms, produce fewer florets than low corms which are two years old and presumably more mature.

The experimenter planted 10 high corms and and 10 low corms in each of 7 different plots of soil. The idea behind planting a mixture of high and low corms in each plot is to attempt to remove variation caused by different soil properties. Statisticians call this "paired data." The data displayed in the table are the mean number of florets from these pairs of 10 plants in each plot. The rows list data from the different plots.


Experimental Data: Floret means per Gladiola

Plots \ Corms--    High      Low    Totals
1                  11.2      14.6    25.8
2                  13.3      12.6    25.9
3                  12.8      15.0    27.8
4                  13.7      15.6    29.3
5                  12.2      12.7    24.9
6                  11.9      12.0    23.9
7                  12.1      13.1    25.2
Totals             87.2      95.6   182.8

Sum of Squares              2407.9
Mean Square of total        2386.85
Mean Sum of Corms Squares   2391.89
Mean Sum of Plots Squares   2397.02

In statistical inference we think of a pair of hypotheses to test with this data. The first hypothesis is that all 14 of these measurements are just samples from a single population. This implies that the classification as to high and low corms is inconsequential. In this hypothesis the variation from one plot to another is all experimental variation from this single population and it doesn't matter that the parent corm is high or low. This is my null hypothesis.

The alternative hypothesis is that the proposed classification is significant; and generally we decide upon a level of significance for the test of these hypotheses at this point. For example we may decide that a 10% level of significance is sufficient in this case because we are already convinced that low corms are more mature and likely to be more efflorescent. In case we were less sure of our alternative, we might choose a lower level of significance--5% or 1%--to provide a more convincing demonstration. I'll leave this issue of significance level open for the present time.

I have calculated the sum of the squares of all the data. This is 2407.9. I have also computed the mean of the squared sum 2386.85. The difference of these is the sum of squares computed about the mean value (this is a computational short-cut). If I divide this by the number of data values minus 1 (13), then I have computed the total variance of my 14 experiments. This is one estimate of population variance. Symbolically, the population variance,s², is approximately...

s²=S(x_i-mean x)²/(n-1)

where; n is the number of values (14 in this case), and the sum is over all the data.

But there are other estimates of population variance available in this data. For example, the columns are divided into classes of high and low corms. If all of the data come from a single population (that is, our null hypothesis is true) then this division is inconsequential and the mean squared difference between the two column totals about their mean is also an estimate of the same population variance,s². In this case...

ks²=S(column total_i-mean of column totals)²/(m-1)

where; m is the number of columns, k is the number of data in each column (the number of plots), and the sum is over the columns.

Following this paragraph I show an analysis of variance in its usual form. The first row summarizes the total variation including all data. There is first the sum of squares about the mean of the data, then the number of degrees of freedom (n-1=13), and the quotient of the two, the mean square or sample variance. Degrees of freedom simply refers to the number of independent pieces of information in the data. There is one less independent peice of information because we have made use of the mean value which prevents the 14th observation from being independent of the other 13. The second row is the sum of squares calculated from the column totals, then its degrees of freedom (m-1=1), followed by the mean square. Finally, the third row begins by taking the difference of the sum of squares of the two previous rows. This is the residual sum of squares. Any fraction of the total sum of squares which is left unexplained by our classification into high and low corms, must reside in the residual sum of squares. Likewise the degrees of freedom in the residual sum of squares is that in the total data less that in our classification.


Analysis of Variance

Variation    Sum of Squares    Degrees of Freedom   Mean Square   F
Total            21.05                 13               1.62            
Corms             5.04                  1               5.04     3.78 (0.10)
Residual         16.01                 12               1.33=09

By our null hypothesis the mean square of the classification should approximately equal the mean square of the residual. They measure the same population variance in this case, after all. The ratio between them should be 1. In reality, however, the ratio is 3.78, which seems quite different from 1. Is this difference significant?

An answer to this question is available through the F statistic. The variance ratio of separate samples drawn from a single population is distributed according to the F statistic. Most spreadsheets provide an F distribution function, but otherwise you can find it in many tables and tutorials on the internet. The last column shows that this value of F would occur only about 10% (0.10) of the time in ratios of variance of samples drawn from a single population. This seems sufficiently unlikely that I conclude that data derived from the high corms is different from that from the low corms. I have reason to reject my null hypothesis and accept its alternative.

I have produced an expanded analysis of variance, below, to illustrate a two-way classification. The various plots of soil used in the experiment provide a second classification of the data. Often soils have varying fertility and introduce a source of experimental variation or error. This was the motivation for making pairs of data in the experiment. After adding this second classification to my analysis of variance, I follow the same course of calculation that I did before.


Analysis of Variance (2-way Classification)

Variation    Sum of Squares    Degrees of Freedom   Mean Square   F
Total            21.05                 13               1.62            
Plots            10.17                  6               1.70     1.74 (0.26)
Corms             5.04                  1               5.01     5.18 (0.06)
Residual          5.84                  6               0.97=09

Notice in the resulting analysis that the mean square from the plots is nearly twice that in the residual. Even though the F-ratio (1.74) is not especially significant (0.26 or 26%) it yet manages to explain nearly half the variation in the raw data (sum of squares of 10.17 vs. 21.05). By including this second classification of the data I managed to reduce the mean sum of squared residual substantially, and this in turn has made the F-ratio (5.18) for the corms even more significant (6%). By accounting for all identifiable sources of variation, and making this analysis of variance, I have shown there is significant reason to reject the null hypothesis, and I have also found a consistent and defensible estimate of my experimental error--0.97 florets squared per plant, or a standard deviation of about 1 floret per gladiola.

Reprinted from: