The exact assumptions and null hypothesis for the chi-square test for independence depend on the sampling scheme used, although the calculated statistic is the same in each case. There are three possible sample schemes for the values in a contingency table with

**R**rows and**C**columns:**Sampling Scheme 1: The total number of data values in the contingency table (N) is fixed, but none of the row or column totals are fixed.**This sampling scheme is known as

**cross-sectional**,**naturalistic**, or**multinomial**sampling. In this case, the assumptions are:The data observations are made on a random sample of

**N**objects, cross-classified according to two attributes, the row variable and the column variable.The sampled values are independent.

Each object is classified into one and only one category of the row variable, and into one and only one category of the column variable.

The event of an observation being in a particular row is independent of that same observation being in a particular column.

**Sampling Scheme 2:The total number of data values in the contingency table (N) is fixed, and either the row marginal totals or the column marginal totals are fixed.**If one of the attributes is viewed as an outcome variable and the other as an explanatory variable (e.g., if one variable is the occupation of the parent and the other is the occupation of the child), then the study is

**retrospective**or a**case-control**study if the marginal totals are fixed for the outcome variable, and the study is**prospective**if the marginal total are fixed for the explanatory variable. If the**r**row marginal totals are fixed such that row**i**has**n[i]**observations in it, the assumptions are:The data observations are made on

**r**random samples, with**n[i]**values in the**i**th sample.Sample

**i**is taken from objects that have the**i**th value of the row attribute.Within each sample, the values are independent.

The

**r**samples are independent.Each object is classified into one and only one category of the column variable.

For any given row, the probability of an observation from that row being in a particular column is the same for all columns.

**Sampling Scheme 3:The total number of data values in the contingency table (N) is fixed, and both the row marginal totals are the column marginal totals are fixed.**This is also the sampling scheme assumed by Fisher's exact test. If the row marginal totals and the column marginal totals are fixed, the assumptions are:

Each object is classified into one and only one category of the row variable, and into one and only one category of the column variable.

The

**N**observations come from a random sample such that each observation has the same probability of being classified into the**i**th row and the**j**th column as any other observation.

The event of an observation being in a particular row is independent of that same observation being in a particular column.

- The chi-square test involves using the chi-square
distribution
to approximate the underlying exact distribution. Although the
chi-square approximation can be used in all three sampling
schemes, the approximation becomes less good when marginal
totals are fixed. The best approximation will be most likely
be in the first (
**multinomial**) sampling scheme. The approximation becomes better as the expected cell frequencies grow larger, and may be inappropriate for contingency tables with very small expected cell frequencies. In case of a 2x2 contingency table, an adjusted value of the chi-square statistic (the**Yates corrected chi-square**) is often used to correct for a continuous distribution (chi-square) being used to approximate the very discrete distribution of the values in the 2x2 table. Fisher's exact test assumes that the total number of data values in the 2x2 contingency table (

**N**) is fixed, and both the row marginal totals and the column marginal totals are fixed.**If the 2 row marginal totals are fixed and the 2 column marginal totals are fixed, the assumptions for Fisher's exact test are:**Each object is classified into one and only one category of the row variable, and into one and only one category of the column variable.

The

**N**observations come from a random sample such that each observation has the same probability of being classified into the**i**th row and the**j**th column as any other observation.

The event of an observation being in a particular row is independent of that same observation being in a particular column.

Among measures of association for two-way contingency tables, Kendall's Tau B, Tau C, Spearman's rho, and Gamma assume that both the row and column variables have ordered categories (such as disease severity categories).

Cross-classification schemes for two-way contingency tables work best when the categories for both variables are discrete (e.g., gender). When a continuous variable such as age is divided into intervals to form the categories of a variable, the interval boundaries should be decided beforehand on the basis of theory or custom. The intervals should not be determined by the particular data being analyzed.

**Ways to detect**before performing a contingency table analysis whether your data violate any assumptions.**Ways to examine**contingency table analysis results to detect assumption violations.**Possible alternatives**if your data or contingency table analysis results indicate assumption violations.

To properly analyze and interpret
results of the *contingency table analysis*,
you should be familiar with the following terms and
concepts:

- contingency table
- measures of association
- test of independence
- Pearson's chi square test
- expected cell frequencies
- (in)appropriate use of chi square test
- Fisher's exact test

- Agresti, A. 1990.
*Categorical Data Analysis.*New York: John Wiley & Sons. - Agresti, A. 1996.
*An Introduction to Categorical Data Analysis.*New York: John Wiley & Sons. - Bishop, Y. M. M., Fienberg, S. E., and Holland, P. W. 1975.
*Discrete Multivariate Analysis.*Cambridge, MA: MIT Press. - Brownlee, K. A. 1965.
*Statistical Theory and Methodology in Science and Engineering.*New York: John Wiley & Sons. - Conover, W. J. 1980.
*Practical Nonparametric Statistics.*2nd ed. New York: John Wiley & Sons. - Daniel, Wayne W. 1978.
*Applied Nonparametric Statistics.*Boston: Houghton Mifflin. - Daniel, Wayne W. 1995.
*Biostatistics.*6th ed. New York: John Wiley & Sons. - Lehmann, E. L. 1975.
*Nonparametrics: Statistical Methods Based on Ranks.*San Francisco: Holden-Day. - Rosner, Bernard. 1995.
*Fundamentals of Biostatistics.*4th ed. Belmont, California: Duxbury Press. - Sokal, Robert R. and Rohlf, F. James. 1995.
*Biometry.*3rd. ed. New York: W. H. Freeman and Co. - Tocher, K.D. 1950. Extension of the Neyman-Pearson theory of tests
to discontinuous variates.
*Biometrika***37**: 130-144. - Zar, Jerrold H. 1996.
*Biostatistical Analysis.*3rd ed. Upper Saddle River, NJ: Prentice-Hall.

Do a **keyword search** of PROPHET
StatGuide.

**
Back** to StatGuide categorical analysis page.

**
Back** to StatGuide home page.

©1996 BBN Corporation All rights reserved.