PROPHET StatGuide: Do your data violate one-way ANOVA assumptions?


If the populations from which data to be analyzed by a one-way analysis of variance (ANOVA) were sampled violate one or more of the one-way ANOVA test assumptions, the results of the analysis may be incorrect or misleading. For example, if the assumption of independence is violated, then the one-way ANOVA is simply not appropriate, although another test (perhaps a blocked one-way ANOVA) may be appropriate. If the assumption of normality is violated, or outliers are present, then the one-way ANOVA may not be the most powerful test available, and this could mean the difference between detecting a true difference among the population means or not. A nonparametric test or employing a transformation may result in a more powerful test. A potentially more damaging assumption violation occurs when the population variances are unequal, especially if the sample sizes are not approximately equal (unbalanced). Often, the effect of an assumption violation on the one-way ANOVA result depends on the extent of the violation (such as how unequal the population variances are, or how heavy-tailed one or another population distribution is). Some small violations may have little practical effect on the analysis, while other violations may render the one-way ANOVA result uselessly incorrect or uninterpretable. In particular, small or unbalanced sample sizes can increase vulnerability to assumption violations.

Potential assumption violations include:


Implicit factors:
A lack of independence within a sample is often caused by the existence of an implicit factor in the data. For example, values collected over time may be serially correlated (here time is the implicit factor). If the data are in a particular order, consider the possibility of dependence. (If the row order of the data reflect the order in which the data were collected, an index plot of the data [data value plotted against row number] can reveal patterns in the plot that could suggest possible time effects.)

Lack of independence:
Whether the samples are independent of each other is generally determined by the structure of the experiment from which they arise. Obviously correlated samples, such as a set of observations over time on the same subjects, are not independent, and such data would be more appropriately tested by a one-way blocked ANOVA or a repeated measures ANOVA. If you are unsure whether your samples are independent, you may wish to consult a statistician or someone who is knowledgeable about the data collection scheme you are using.

Outliers:
Values may not be identically distributed because of the presence of outliers. Outliers are anomalous values in the data. Outliers tend to increase the estimate of sample variance, thus decreasing the calculated F statistic for the ANOVA and lowering the chance of rejecting the null hypothesis. They may be due to recording errors, which may be correctable, or they may be due to the sample not being entirely from the same population. Apparent outliers may also be due to the values being from the same, but nonnormal, population. The boxplot and normal probability plot (normal Q-Q plot) may suggest the presence of outliers in the data.

The F statistic is based on the sample means and the sample variances, each of which is sensitive to outliers. (In other words, neither the sample mean nor the sample variance is resistant to outliers, and thus, neither is the F statistic.) In particular, a large outlier can inflate the overall variance, decreasing the F statistic and thus perhaps eliminating a significant difference. A nonparametric test may be a more powerful test in such a situation. If you find outliers in your data that are not due to correctable errors, you may wish to consult a statistician as to how to proceed.

Nonnormality:
The values in a sample may indeed be from the same population, but not from a normal one. Signs of nonnormality are skewness (lack of symmetry) or light-tailedness or heavy-tailedness. The boxplot, histogram, and normal probability plot (normal Q-Q plot), along with the normality test, can provide information on the normality of the population distribution. However, if there are only a small number of data points, nonnormality can be hard to detect. If there are a great many data points, the normality test may detect statistically significant but trivial departures from normality that will have no real effect on the F statistic.

For data sampled from a normal distribution, normal probability plots should approximate straight lines, and boxplots should be symmetric (median and mean together, in the middle of the box) with no outliers.

The one-way ANOVA's F test will not be much affected even if the population distributions are skewed, but the F test can be sensitive to population skewness if the sample sizes are seriously unbalanced. If the sample sizes are not unbalanced, the F test will not be seriously affected by light-tailedness or heavy-tailedness, unless the sample sizes are small (less than 5), or the departure from normality is extreme (kurtosis less than -1 or greater than 2).

Robust statistical tests operate well across a wide variety of distributions. A test can be robust for validity, meaning that it provides P values close to the true ones in the presence of (slight) departures from its assumptions. It may also be robust for efficiency, meaning that it maintains its statistical power (the probability that a true violation of the null hypothesis will be detected by the test) in the presence of those departures. The one-way ANOVA's F test is robust for validity against nonnormality, but it may not be the most powerful test available for a given nonnormal distribution, although it is the most powerful test available when its test assumptions are met. In the case of nonnormality, a nonparametric test or employing a transformation may result in a more powerful test.

Unequal population variances:
The inequality of the population variances can be assessed by examination of the relative size of the sample variances, either informally (including graphically), or by a robust variance test such as Levene's test. (Bartlett's test is even more sensitive to nonnormality than the one-way ANOVA's F test, and thus should not be used for such testing.) The effect of inequality of variances is mitigated when the sample sizes are equal: The F test is fairly robust against inequality of variances if the sample sizes are equal, although the chance increases of incorrectly reporting a significant difference in the means when none exists. This chance of incorrectly rejecting the null hypothesis is greater when the population variances are very different from each other, particularly if there is one sample variance very much larger than the others.

The effect of inequality of the variances is most severe when the sample sizes are unequal. If the larger samples are associated with the populations with the larger variances, then the F statistic will tend to be smaller than it should be, reducing the chance that the test will correctly identify a significant difference between the means (i.e., making the test conservative). On the other hand, if the smaller samples are associated with the populations with the larger variances, then the F statistic will tend to be greater than it should be, increasing the risk of incorrectly reporting a significant difference in the means when none exists. This chance of incorrectly rejecting the null hypothesis in the case of unbalanced sample sizes can be substantial even when the population variances are not very different from each other.

Although the effect of unbalanced sample sizes and unequal population variances increases for smaller sample sizes, it does not decrease substantially if the sample sizes are increased without changing the lack of balance in the sample sizes. For this reason, and because equal sample sizes mitigate the effect of unequal population variances, the best course is to keep the sample sizes as equal as possible.

If both nonnormality and unequal variances are present, employing a transformation may be preferable. A nonparametric test like the Kruskal-Wallis test still assumes that the population variances are comparable.

Patterns in plot of data:
The plot of each sample's values against its mean (or its sample ID) will consist of vertical "stacks" of data points, one stack for each unique sample mean value. If the assumptions for the samples' population distributions are correct, the stacks should be about the same length. Outliers may appear as anomalous points in the graph.

A fan pattern like the profile of a megaphone, with a noticeable flare either to the right or to the left as shown in the picture (one or more of the "stacks" of data points is much longer than the others), suggests that the variance in the values increases in the direction the fan pattern widens (usually as the sample mean increases), and this in turn suggests that a transformation may be needed.

Side-by-side boxplots of the samples can also reveal lack of homogeneity of variances if some boxplots are much longer than others, and reveal suspected outliers.

Special problems with small sample sizes:
If one or more the sample sizes is small, it may be difficult to detect assumption violations. With small samples, violation assumptions such as nonnormality or inequality of variances are difficult to detect even when they are present. Also, with small sample size(s) the one-way ANOVA's F test offers less protection against violation of assumptions.

Even if none of the test assumptions are violated, a one-way ANOVA with small sample sizes may not have sufficient power to detect any significant difference among the samples, even if the means are in fact different. The power depends on the error variance, the selected significance (alpha-) level of the test, and the sample size. Power decreases as the variance increases, decreases as the significance level is decreased (i.e., as the test is made more stringent), and increases as the sample size increases. With very small samples, even samples from populations with very different means may not produce a significant one-way ANOVA F test statistic unless the sample variance is small. If a statistical significance test with small sample sizes produces a surprisingly non-significant P value, then a lack of power may be the reason. The best time to avoid such problems is in the design stage of an experiment, when appropriate minimum sample sizes can be determined, perhaps in consultation with a statistician, before data collection begins.

Special problems with unbalanced sample sizes:
The one-way ANOVA test is not too sensitive to inequality of variances if the sample sizes are equal. If the sample sizes are not approximately equal, and especially if the larger sample variances are associated with the smaller sample sizes, then the calculated F statistic may be dominated by the sample variances for the larger samples, so that the test is less likely to correctly identify significant differences in the means if the larger samples are associated with the larger population variances, and more likely to report nonexistent differences in the means if the smaller samples are associated with the larger population variances. Unbalanced sample sizes also increase any effect due to nonnormality, and require adjustments to be made in calculating multiple comparisons tests.

Multiple comparisons:
In general, the multiple comparisons tests will be robust in those situations when the one-way ANOVA's F test is robust, and will be subject to the same potential problems with unequal variances, particularly when the sample sizes are unequal. As with the one-way ANOVA itself, the best protection against the effects of possible assumption violations is to employ equal sample sizes. Unequal variances may make individual comparisons of means inaccurate, because the multiple comparison techniques rely on a pooled estimate for the variance, based on the assumption that the sample variances are equal.

Ideally, the sample sizes will be equal for all-pairwise multiple comparison tests. When they are not, an adjustment must be made to the calculations. The Tukey-Kramer adjustment (based on the harmonic mean of each pair's sample sizes), which Prophet uses, may be conservative (that is, it may be less likely to flag means as different than the nominal significance level would suggest), but in general performs well. An alternative procedure is to use the harmonic mean of all the sample sizes for all the pairwise comparisons. This has the disadvantage that the actual significance level of the test is more often different from the nominal significance level than is the case with the Tukey-Kramer adjustment; worse, the actual significance level of the test may be greater than the nominal significance level, meaning that the test is more likely to incorrectly flag a mean difference as significant.


Examine the glossary.

Do a keyword search of PROPHET StatGuide.

Back to StatGuide one-way ANOVA page.

Back to StatGuide home page.

Last modified: March 17, 1997

©1996 BBN Corporation All rights reserved.