- alternative hypothesis:
- The null hypothesis for
a statistical test is the assumption that the test uses for
calculating the probability of observing a result at least
as extreme as the one that occurs in the data at hand.
An alternative hypothesis is one that specifies
that the null hypothesis is not true.
For
the one-sample t test,
the null hypothesis is that the
population mean equals a specific value.
For a two-sided test, the alternative hypothesis
is that the mean does not equal that value. It is also
possible to have a one-sided test with the
alternative hypothesis that the mean is greater than
the specified value, if it is theoretically impossible
for the mean to be less than the specified value.
One could alternatively perform one-sided test with
the alternative hypothesis that the mean is less than
the specified value, if it were theoretically impossible
for the mean to be greater than the specified value.
One-sided tests usually have more
power than two-sided tests, but they require
more stringent assumptions. They should only be
used when those assumptions (such as the mean always
being at least as large as they specified value for
the one-sample t test) apply.
- between effects:
-
In a repeated measures ANOVA, there will be
at least one factor that is measured at each level for every subject.
This is a within (repeated measures) factor.
For example, in an experiment in which each subject performs the same
task twice, trial (or trial number) is a within factor.
There may also be one or more factors that are measured at only
one level for each subject, such as gender. This type of factor
is a between or grouping factor.
- bias:
-
An estimator for a parameter is unbiased
if its expected value is the true value of the
parameter. Otherwise, the estimator is biased.
- binary variable:
-
A binary random variable is a discrete
random variable that has only two possible values, such as whether
a subject dies (event) or lives (non-event).
Such events are often described as success vs failure.
- boxplot:
A boxplot is a graph summarizing the
distribution
of a set of data values. The upper and lower ends of
of the center box indicate the 75th and 25th percentiles
of the data, the center box indicates the median,
and the center + indicates the mean.
Suspected
outliers
appear in a boxplot as individual points o
or x outside the box. The o outlier values are known
as outside values, and the x outlier values as
far outside values.
If the difference (distance) between
the 75th and 25th percentiles of the data is H,
then the outside values are those values that are
more than 1.5H but no more than 3H above the upper quartile,
and those values that are
more than 1.5H but no more than 3H below the lower quartile.
The far outside values are values that are at least
3H above the upper quartile or 3H below the lower quartile.
Examples of these plots
illustrate various situations.
- cell:
-
In a multi-factor ANOVA
or in a contingency table,
a cell is an individual combination of possible
levels (values)
of the factors. For example,
if there are two factors, gender with values male
and female and risk with values low,
medium, and high, then there are 6 cells:
males with low risk, males with medium risk, males with
high risk,
females with low risk, females with medium risk, and females with
high risk.
- censoring:
-
In an experiment in which subjects are followed over time
until an event of interest (such as death or other type of failure)
occurs, it is not always possible to follow
every subject until the event is observed. Subjects may drop out
of the study and be lost to follow-up, or be deliberately
withdrawn, or the end of the data collection period may
arrive before the event is observed to happen. For
such a subject, all that is known is that the time to
the event was at least as long as the time to when
the subject was last observed. The observed time to the event
under such circumstances is censored.
Survival analysis
methods generally allow for censored data.
Censoring may occur from the right (observation
stops before the event is observed), as in
censorship for survival analysis, or from
the left (observation does not begin until
after the event has occurred).
- central tendency:
- The generalized concept of the "average" value of
a distribution. Typical
measures of central tendency are
the mean, the median, the mode, and the geometric mean.
- centroid:
-
The centroid of a set of multi-dimensional data points is
the data point that is the mean of the values in each
dimension. For X-Y data, the centroid is the
point at (mean of the X values, mean of the Y values).
A simple linear regression line always passes through
the centroid of the X-Y data.
- chi-square test for goodness of fit:
-
The chi-square test for
goodness of fit tests the hypothesis that the
distribution
of the population
from which nominal data are drawn
agrees with a posited distribution.
The chi-square goodness-of-fit test compares observed and
expected frequencies
(counts). The chi-square test statistic is basically the
sum of the squares of the differences between the observed
and expected frequencies, with each squared difference
divided by the corresponding expected frequency.
- chi-square test for independence (Pearson's):
-
Pearson's
chi-square test for independence
for a contingency table
tests the null hypothesis
that the row classification factor
and the column classification factor
are independent.
Like the chi-square goodness-of-fit test, the chi-square
test for independence compares observed and
expected frequencies
(counts). The expected frequencies are calculated
by assuming the null hypothesis is true.
The chi-square test statistic is basically the
sum of the squares of the differences between the observed
and expected frequencies, with each squared difference
divided by the corresponding expected frequency.
Note that the chi-square statistic is always calculated
using the counted frequencies. It can not
be calculated using the observed proportions, unless the
total number of subjects (and thus the frequencies) is
also known.
- conservative:
-
A hypothesis test is conservative if the actual significance level
for the test is smaller than the stated significance level of the test.
An example is the Kolmogorov-Smirnov distribution test,
which becomes conservative when the parameters of the distribution are
estimated from the data instead of being specified in advance.
A conservative test may incorrectly fail to reject the
null hypothesis, and thus is
less powerful than was expected.
- consistent:
-
A hypothesis test is consistent for a specified
alternative hypothesis
if the power of the test for the alternative
hypothesis approaches 1 as the sample size becomes infinitely large.
- contaminated normal distribution:
- A contaminated normal distribution is a type of
mixture distribution for
which observed values can come from one of multiple
normal distributions.
For example, in taking measurements of blood pressure from a population,
the distribution for males may be a normal distribution,
the distribution for females may also be a normal distribution, but if
the two normal distributions do not have the same mean and variance,
then the composite distribution is not normal.
A common type of contaminated normal distribution is a composite of
two normal distributions with the same mean, but with different
variances, such that only a minority of the values come from
the distribution with the larger variance. Such a distribution
is heavy-tailed relative to the
normal distribution.
If the proportion of values from the distribution with the larger
variance is small enough, the contaminated normal distribution
may look like a normal distribution with outliers. In such a
situation, one should be alert to the possibility of a connection
or common trait among the outlying values that might suggest
that all come from a second distribution with a different variance.
- contingency table:
-
If individual values are cross-classified by levels in two different
attributes (factors), such as gender and
tumor vs no tumor, then a contingency table is the tabulated
counts for each combination of levels of the two factors, with
the levels of one factor labeling the rows of the table, and
the levels of the other factor labeling the columns of the table.
For the factors gender and presence of tumor, each with two levels, we would
get a 2x2 contingency table, with rows Male and Female, and
columns Tumor and No Tumor.
The counts for each cell
in the
table would be the number of subjects with the corresponding
row level of gender and column level of tumor vs no tumor:
females with tumors in row 1, column 1; females without tumors in row 1,
column 2; males with tumors in row 2, column 1; and
males without tumors in row 2, column 2, as shown in the picture.
Contingency tables
are also known as cross-tabulations. The most common method
of analyzing such tables statistically
is to perform a (Pearson) chi-square test for independence
or Fisher's exact test.
- correlation:
-
Correlation is the linear association
between two random variables X and Y. It is usually
measured by a correlation coefficient, such
as Pearson's r, such that the value of
the coefficient ranges from -1 to 1.
A positive value of r means that the association
is positive; i.e., that if X increases, the
value of Y tends to increase linearly, and if X decreases,
the value of Y tends to decrease linearly.
A negative value of r means that the association
is negative; i.e., that if X increases, the
value of Y tends to decrease linearly, and if X decreases,
the value of Y tends to increase linearly. The larger
r is in absolute value, the stronger the
linear association between X and Y. If r
is 0, X and Y are said to be uncorrelated,
with no linear association between X and Y.
Independent
variables are always uncorrelated, but
uncorrelated variables need not be independent.
- covariate:
-
A covariate is a variable that may affect the relationship between
two variables of interest, but is not of intrinsic interest itself.
As in blocking or
stratification, a covariate
is often used to control for variation that is not attributable
to the variables under study. A covariate may be a discrete
factor, like a block effect, or
it may be a continuous variable, like the X variable in
an analysis of covariance.
Note that some people use the term covariate to include
all the variables that may effect the response
variable, including both the primary (predictor) variables,
and the secondary variables we call covariates.
- curvilinear functions:
-
A curvilinear function is one whose value, when plotted, will follow
a continuous but not necessarily straight line, such as a polynomial,
logistic, exponential, or sinusoidal curve.
- death density function:
-
The death density function is a time to failure
function that gives the instantaneous probability
of the event (failure).
That is, in a survival experiment where the event is death,
the value of the density function at time T is the
probability that a subject will die precisely at time T.
This differs from the hazard function,
which gives the probability
conditional on a subject having survived to time T.
The death density function is always
nonnegative (greater than or equal to 0),
and a peak in the function indicates a time
at which the probability of failure is high.
Other names for the death density function are
probability density function and
unconditional failure rate.
Related functions are the
hazard function,
the conditional instantaneous probability of the event (failure)
given survival up to that time;
and the survival function, which
represents the probability that the event (failure) has not yet occurred.
The cumulative hazard function is the integral over time
of the hazard function, and is estimated as the negative logarithm
of the survival function.
- distribution function:
-
A distribution function (also known as the probability distribution
function) of a continuous random variable X is a mathematical
relation that gives for each number x, the probability that
the value of X is less than or equal to x. For example,
a distribution function of height gives,
for each possible value of height, the probability that
the height is less than or equal to that value.
For discrete random variables, the distribution function
is often given as the probability associated with
each possible discrete value of the random variable;
for instance, the distribution function for a fair
coin is that the probability of heads is 0.5 and
the probability of tails is 0.5.
- distribution-free tests:
-
Distribution-free tests are tests whose validity
under the null hypothesis does not require a specification of
the population
distribution(s)
from which the data have been
sampled.
- expected cell
frequencies:
-
For nominal (categorical) data in which the count of items
in each category has been tabulated, the observed frequency
is the actual count, and the expected frequency is the count
predicted by the theoretical distribution
underlying the data. For example, if the hypothesis is that a certain
plant has yellow flowers 3/4 of the time and white
flowers 1/4 of the time, then for 100 plants, the
expected frequencies will be 75 for yellow and 25 for
white. The observed frequencies will be the actual
counts for 100 plants (say, 73 and 27).
- factors:
-
A factor is a single discrete classification scheme for data, such that
each item classified belongs to exactly one class
(level)
for that classification scheme. For example, in a drug experiment involving
rats, sex (with levels male and female) or
drug received could be factors.
A one-way analysis of variance
involves a single factor classifying
the subjects (e.g., drug received);
multi-factor analysis of variance
involves multiple factors classifying the subjects
(e.g., sex and drug received).
- fixed effects:
-
In an experiment using a fixed-effect design, the results of the experiment
apply only to the populations included in the experiment.
Those populations include all (or at least most of) those of interest.
This is true for many experiments, where the effects are due to
such variables as gender, age categories, disease states, or treatments.
When the populations included in the experiment are a random subset
of those of interest, then the experiment follows a
random-effects design.
Multiple comparisons tests
for an analysis of variance may be applied when the effects
are fixed. They are not appropriate if the effects are random.
Whether an effect is considered random or fixed
may depend on the circumstances. A factory may conduct an experiment
comparing the output of several machines. If those machines are the
only ones of interest (because they constitute the entire set of
machines owned by that company), then machine will be a fixed effect.
If the machines were instead selected randomly from among those
owned by the company, then machine would be a random effect.
- Fisher's exact test:
-
Fisher's exact test for a 2x2
contingency table
is a test of the null hypothesis
that the row classification factor
and the column classification factor
are independent.
Fisher's exact test consists of calculating the
actual (hypergeometric) probability of
the observed 2x2 contingency table
with respect to all other possible 2x2 contingency tables
with the same column and row totals. The probabilities of
all such tables that are each no more likely than the
observed table are calculated. The sum of these
probabilities is the P value. If the sum is less than or equal to the specified
significance level,
then the null hypothesis is rejected.
- goodness of fit:
-
Goodness-of-fit tests
test the conformity of the observed data's
empirical distribution function
with a posited theoretical
distribution function.
The chi-square goodness-of-fit test
does this by comparing
observed and expected frequency counts. The
Kolmogorov-Smirnov test
does this by calculating the maximum vertical distance between
the empirical and posited distribution functions.
- hazard function:
-
The hazard function is a time to failure
function that gives the instantaneous probability
of the event (failure) given that it has not yet occurred.
That is, in a survival experiment where the event is death,
the value of the hazard function at time T is the
probability that a subject will die precisely at time T,
given that the subject has survived to time T.
The function may increase with time, meaning that
the longer subjects survive, the more likely
it becomes that they will die shortly (as for
cancer patients who do not respond to treatment).
It may decrease with time, meaning that the longer
subjects survive, the more likely it is that
they will survive into the near future (as
for post-operative survival for gunshot victims).
It may remain constant, as for a population
with a (negative) exponential survival distribution.
Or it may have a more complicated shape, like the
well-known "bathtub" curve for human mortality, where
the hazard is high for newborns, drops quickly,
stays low through adulthood, and then rises again
in old age.
Other names for the hazard function are
instantaneous failure rate,
force of mortality,
conditional mortality rate, and
age-specific failure rate.
Related functions are the
death density function,
the unconditional instantaneous probability of the event (failure);
and the survival function, which
represents the probability that the event (failure) has not yet occurred.
The cumulative hazard function is the integral over time
of the hazard function, and is estimated as the negative logarithm
of the survival function.
- heavy-tailed:
- A heavy-tailed distribution
is one in which
the extreme portion of the distribution
(the part farthest away from the median)
spreads out further relative to the width
of the center (middle 50%) of the distribution than is the case
for the normal distribution.
For a symmetric heavy-tailed distribution like the Cauchy
distribution, the probability of observing a value
far from the median in either direction is greater
than it would be for the normal distribution.
Boxplots may help in detecting
heavy-tailedness;
normal probability plots may also help in detecting
heavy-tailedness.
- histogram:

A histogram is a graph of grouped (binned) data in which the number of
values in each bin is represented by the area of a rectangular box.
-
homoscedasticity (homogeneity of variance):
- Normal-theory-based tests for the equality of
population means such as
the t test and analysis of variance, assume that the data come from
populations
that have the same variance, even if the test rejects the
null hypothesis of equality of population means.
If this assumption of homogeneity of variance is not met,
the statistical test results may not be valid.
Heteroscedasticity refers to lack of homogeneity of variances.
- (in)appropriate use of
chi-square test:
-
Pearson's chi-square test
for independence for a contingency table
involves using a normal approximation to the actual
distribution
of the frequencies in the contingency table. This approximation
becomes less reliable when the
expected frequencies
for the contingency table are very small.
A standard (and conservative) rule of thumb (due to Cochran) is to avoid using
the chi-square test for contingency tables with expected
cell frequencies less than 1, or when more than 20% of
the contingency table cells have expected cell frequencies
less than 5.
In such cases, an alternate test like Fisher's exact test
for a 2x2 contingency table should be considered for a
more accurate evaluation of the data.
- independent:
- Two random variables are independent if their joint
probability density is the product of their individual
(marginal) probability densities. Less technically,
if two random variables A and B are independent, then
the probability of any given value of A is unchanged
by knowledge of the value of B. A
sample
of mutually independent random variables
is an independent sample.
- index plot:
- An index plot of data values is a plot of each value (Y) against
its order in the data set (X). If data are entered into a table in the
order in which they are collected, for example, then a plot of data value against
row number will produce an index plot. An index plot may help detect
correlation between successive data values,
a sign of lack of independence.
- interaction:
-
In multi-factor analysis of variance,
factors A and B interact
if the effect of factor A is
not independent of the level of factor B.
For example, in an drug experiment involving
rats, there would be an interaction between the factors
sex and treatment
if the effect of treatment was not the same for males and females.
- kurtosis:
- Kurtosis is a measure of the heaviness of the tails in a
distribution, relative to the
normal distribution.
A distribution with negative kurtosis (such as the uniform distribution)
is light-tailed relative to the
normal distribution, while
a distribution with positive kurtosis (such as the Cauchy distribution)
is heavy-tailed relative to the
normal distribution.
- levels within
factors:
-
When a factor is used to classify
subjects, each subject is assigned to one class value;
e.g., male or female for the factor sex or the specific treatment
given for the factor treatment. These individual class values
within a factor are called levels. Each subject is assigned
to exactly one level for each factor.
Each unique combination of levels for each factor is a cell.
- leverage:
- Leverage is a measure of the amount of influence a given
data value has on a
fitted linear regression.
For a change in an observed Y value, the
leverage is the proportional change in the fitted Y value.
- life table method:
-
For survival studies,
life tables
are constructed by partitioning time into intervals
(usually equal intervals), and then counting for each time interval:
the number of subjects alive at the start of the interval,
the number who die during the interval, and the number
who are lost to follow-up or withdrawn during the interval.
Those lost or withdrawn are censored.
Those alive at the end of a time interval were at risk for
the entire interval. Under the usual actuarial method
of survival function
estimation for life tables,
the estimate of the probability of survival
within each time interval is calculated by
assuming that any values censored in that interval
were at risk for half the interval.
Death can be replaced by any other identifiable
event.
Unlike the Kaplan-Meier product-limit method,
the life table survival estimate can still be
calculated even if the exact survival or censoring
times are not known for each individual, as long
as the number of individuals who die or
are censored within each time interval is known.
- light-tailed:
- A light-tailed distribution
is one in which
the extreme portion of the distribution
(the part farthest away from the median)
spreads out less far relative
to the width of the center (middle 50%) of the distribution
than is the case for the
normal distribution.
For a symmetric light-tailed distribution like the uniform
distribution, the probability of observing a value
far from the median in either direction is smaller
than it would be for the normal distribution.
Boxplots may help in detecting
light-tailedness;
normal probability plots may also help in detecting
light-tailedness.
- linear functions:
-
A linear function of one or more X variables is
a linear combination of the values of the
variables:
Y = b0 + b1*X1 + b2*X2 + ... + bk*Xk.
An X variable in the equation could be a curvilinear function of
an observed variable (e.g., one might measure distance,
but think of distance squared as an X variable in the model,
or X2 might be the square of X1),
as long as the overall function (Y) remains a
sum of terms that are each an X variable multiplied
by a coefficient (i.e., the function Y is linear in
the coefficients).
Sometimes, an apparently nonlinear function can be
made linear by a transformation of Y,
such as the function
Y = exp(b0 + b1*X1),
which
can be made a linear function by taking the logarithm of Y
(log(Y) = b0 + b1*X1),
and then considering
log(Y) to be the overall function.
- linear logistic model:
-
A linear logistic model assumes that for each possible set of values
for the independent (X) variables, there is a probability p
that an event (success) occurs.
Then the model is that Y is
a linear combination of the values of the X
variables:
Y = b0 + b1*X1 + b2*X2 + ... + bk*Xk,
where Y is the logit tranformation of
the probability p.
- linear regression:
-
In a linear regression,
the fitted (predicted) value of the response
variable Y is a linear combination of the values of one or
more predictor (X) variables:
fitted Y = b0 + b1*X1 + b2*X2 + ... + bk*Xk.
An X variable in the model equation could be a nonlinear function of
an observed variable (e.g., one might observe distance, but
use distance squared as an X variable in the model,
or X2 might be the square of X1),
as long as the fitted Y remains a sum of terms that
are each an X variable multiplied by a coefficient.
The most basic linear regression
model is simple linear regression, which involves
one X variable:
fitted Y = b0 + b1*X.
Multiple linear regression
refers to a linear regression with more than one X variable.
- location:
- The generalized concept of the "average" value of
a distribution.
Typical measures of location are
the mean, the median, the mode, and the geometric mean.
- logit transformation:
- The logit transformation Y
of a probabilty p of an event is the logarithm of the ratio between the
probability that the event occurs and the probability that
the event does not occur:
Y = log(p/(1-p)).
- log-rank test:
-
In survival analysis,
a log-rank test
compares the equality of k survival functions
by creating a sequence of kx2
contingency tables
(k survival functions by event observed/event not observed at that time)
one at each (uncensored)
observed event time, and calculating a statistic
based on the observed and expected values for these
contingency tables. This test is also known as the
Mantel-Cox (Mantel-Haenszel) test. The Tarone-Ware
and Gehan-Breslow tests are weighted variants
of the log-rank test; the Peto and Peto log-rank test involves
a different generalization of this log-rank scheme.
- matched samples:
- Matching, also known as pairing (with two samples) and
blocking (with multiple samples) involves matching up individuals
in the samples
so as to minimize their dissimilarity except
in the factor(s)
under study. For example, in pre-test/post-test
studies, each subject is paired (matched) with himself, so that the
difference between the pre-test and post-test responses
can be attributed to the change caused by taking the test, and
not to differences between the individuals taking the test.
A study involving animals might be blocked by matching up
animals from the same litter or from the same cage.
The goal is to
minimize the variation within the pairs or blocks
while maximizing the variation between them.
This will minimize variation between subjects
that is not attributable to the factors under study
by attributing it to the blocking factor.
The matched items in a pair or in a block are related
by their membership in that pair or block.
Other methods for controlling for variation between
subjects for variables that are not of direct
interest are stratification
and the use of covariates.
- method of maximum likelihood:
-
The method of maximum likelihood is a general method of finding
estimated (fitted) values of parameters. Estimates are
found such that the joint likelihood function, the
product of the values of the distribution function for
each observed data value, is as large as possible.
The estimation process involves considering the
observed data values as constants and the parameter
to be estimated as a variable, and then using differentiation
to find the value of the parameter that maximizes the likelihood function.
The maximum likelihood method works best for large samples, where
it tends to produce estimators with the smallest possible variance. The
maximum likelihood estimators are often biased
in small samples.
The maximum likelihood estimates for the slope and
intercept in simple linear regression,
are the same as the least squares estimates when the underlying
distribution for Y is normal. In this case, the
maximum likelihood estimators are thus unbiased.
In general, however, the maximum likelihood and least
squares estimates need not be the same.
- measures of association:
- For cross-tabulated data in a
contingency table,
a measure of association measures the degree of
association between the row and column classification
variables. Measures of association include the
coefficient of contingency,
Cramer's V, Kendall's tau-B, Kendall's tau-C,
gamma, and Spearman's rho,
- method of least squares:
-
The method of least squares is a general method of finding
estimated (fitted) values of parameters. Estimates are
found such that the sum of the squared differences
between the fitted values and the corresponding observed
values is as small as possible. In the case of
simple linear regression,
this means placing the fitted line such that the
sum of the squares vertical distances between the
observed points and the fitted line is minimized.
- median:
-
The median of a distribution is the value X such that
the probability of an observation from the distribution
being below X is the same as the probability of the
observation being above X. For a continuous distribution,
this is the same as the value X such that the probability
of an observation being less than or equal to X is 0.5.
- median remaining lifetime:
-
For survival studies using
life tables, the median
remaining lifetime for an interval of the life table
is the estimate of the additional elapsed time before
only half the individuals alive at the
beginning of current interval are still alive.
This is also known as the median residual lifetime.
- mixed models:
-
Factors in an analysis of variance (ANOVA) may be either
fixed or
random.
Multi-factor ANOVA models in which at least one effect is fixed
and at least one effect is random are called mixed models, especially
a two-factor factorial ANOVA in which one factor is fixed and the
other is random. A randomized block ANOVA is also usually a mixed model, since the factor
of interest is usually a fixed effect.
For two-factor factorial ANOVA, a mixed model is also referred
to as a Type III model. (If both effects are fixed, it's a Type I model,
and if both effects are random, it's a Type II model.)
Sometimes, the term mixed model is also applied to ANOVA models
in which at least one factor is a
repeated measures (within)
factor, and at least one factor is a
grouping (between) factor.
- mixture distribution:
- A mixture distribution is a distribution for
which observed values can come from one of multiple distributions.
For example, in taking measurements of blood pressure from a population,
the distribution for males may be a normal distribution,
the distribution for females may also be a normal distribution, but if
the two normal distributions do not have the same mean and variance,
then the composite distribution is not normal.
- multicollinearity:
- In a multiple regression
with more than one X variable,
two or more X variables are collinear if they are nearly
linear combinations of each other.
Multicollinearity can make the calculations
required for the regression unstable, or
even impossible. It can also produce
unexpectedly large estimated standard errors
for the coefficients of the X variables involved.
Multicollinearity is also known as
collinearity and ill conditioning.
- multiple
comparisons:
-
An analysis of variance F test for a specific factor
tests the hypothesis that all the level means are
the same for that factor. However, if the null
hypothesis is rejected, the F test does not give
information as to which level means differ
from which other level means.
Multiplicity
issues make doing individual tests to compare
each pair of means inappropriate unless the
nominal (comparisonwise)
significance level
is adjusted to account for the number
of pairs (as in a Bonferroni method). An alternative
approach is to devise a test (such as Tukey's test)
specifically designed to keep the overall (experimentwise)
significance level at the desired value while
allowing for the comparison of all possible
pairs of means. This is a multiple comparisons test.
- multiple regression:
-
Multiple regression refers to a regression model in which the
fitted value of the response variable Y is a function of the values of one or
more predictor (X) variables. The most common form of multiple regression
is multiple linear regression,
a linear regression
model with more than one X variable.
- multiplicity of
testing:
-
Even when the
null hypothesis is true, a statistical hypothesis
test has a small probability (the preselected alpha-level or
significance level)
of falsely rejecting the null hypothesis.
With a significance level of 0.05, this could be considered
as the probability of seeing 20 come up on a 20-sided fair die.
If multiple tests are done (the die is rolled multiple times),
even if the null hypothesis in each case is true,
the probability of getting at least one such false rejection
(seeing 20 turn up at least once) increases. For the common problem of
comparing pairwise mean differences
following an analysis of variance,
the probability of seeing at least one such false
rejection could approach 90% when there are 10 level means
in the factor. To avoid the multiplicity problem,
multiple comparison tests have been devised to allow for
simultaneous inference about all the pairwise comparisons
while maintaining the desired significance level.
- multi-sample problem:
- In the multi-sample problem, multiple independent
random samples
are collected, and then the samples are used to test a hypothesis
about the populations
from which the samples came (e.g., whether the
means of the populations are all identical).
- nonlinear functions:
-
A nonlinear function is one that is not a
linear function, and
can not be made into a linear function by
transforming
the Y variable.
- nonlinear regression:
-
In a nonlinear regression,
the fitted (predicted) value of
the response variable is a nonlinear function
of one or more X variables.
- nonparametric
tests:
- Nonparametric tests
are tests that do not make distributional
assumptions, particularly the usual
distributional assumptions of the normal-theory based tests.
These include tests that do not involve
population
parameters at all (truly nonparametric tests
such as the chi-square goodness of fit
test), and distribution-free tests,
whose validity does not depend on
the population distribution(s) from which the data have been
sampled.
In particular, nonparametric tests usually drop the
assumption that the data come from
normally distributed
populations. However, distribution-free tests
generally do make some assumptions, such
as equality of population variances.
- normal (Gaussian)
distribution:

The normal or Gaussian distribution is a continuous symmetric
distribution
that follows the familiar bell-shaped curve. The distribution
is uniquely determined by its mean and variance. It has been
noted empirically that many measurement variables have distributions
that are at least approximately normal. Even when a distribution
is nonnormal, the distribution of the mean of many independent
observations from the same distribution becomes arbitrarily
close to a normal distribution as the number of observations
grows large. Many frequently used statistical tests
make the assumption that the data come from a normal
distribution.
- normal probability
plot:
A normal probability plot, also known as a
normal Q-Q plot or normal quantile-quantile plot,
is the plot of the ordered data values (as Y)
against the associated quantiles of the
normal distribution
(as X).
For data from a normal distribution, the points of the plot
should lie close to a straight line.
Examples of these plots
illustrate various situations.
- null hypothesis:
- The null hypothesis for a statistical test is the
assumption that the test uses for calculating the probability
of observing a result at least as extreme as the one that occurs
in the data at hand. For
the two-sample unpaired t test,
the null hypothesis is that the two
population
means are
equal, and the t test involves finding the probability
of observing a t statistic at least as extreme as the one calculated
from the data, assuming the null hypothesis is true.
- one-sample problem:
- In the one-sample problem, an independent
random sample is
collected, and then that sample is used to test a hypothesis
about the population
from which the sample came (e.g., whether the
mean of the population is 0, or any other fixed constant chosen in advance).
Paired samples are usually
reduced to a one-sample problem by replacing each pair
of responses by the difference between them (e.g.,
in a pre-test/post-test experiment, recording the
change from pre-test to post-test).
- order statistics:
- If the data values in a sample are sorted into increasing order,
then the ith order statistic is the ith largest data value.
For a sample of size N, common order statistics are the extremes,
the minimum (first order statistic) and maximum (Nth order statistic).
Quantiles or percentiles such as the median are also calculated
from order statistics.
- outliers:
- Outliers are anomalous values in the data.
They may be due to recording errors, which may be
correctable, or they may be due to the
sample
not being entirely from the same
population.
Apparent outliers
may also be due to the values being from the same, but
nonnormal
(in particular,
heavy-tailed), population distribution.
- P value:
- In a statistical hypothesis test, the P value is
the probability of observing a test statistic
at least as extreme as the value actually observed,
assuming that the null hypothesis
is true. This probability is then compared to the
pre-selected significance level
of the test. If the P value is smaller than the
significance level, the null hypothesis is rejected,
and the test result is termed significant.
The P value depends on both the null hypothesis and
the alternative hypothesis.
In particular, a test with a one-sided alternative hypothesis
will generally have a lower P value (and thus be more likely
to be significant) than a test with a two-sided alternative
hypothesis. However, one-sided tests require
more stringent assumptions than two-sided tests. They should only be
used when those assumptions apply.
- paired samples:
- Pairing involves matching up individuals
in two samples so as to minimize their dissimilarity except
in the factor
under study. For example, in pre-test/post-test
studies, each subject is paired (matched) with himself, so that the
difference between the pre-test and post-test responses
can be attributed to the change caused by taking the test, and
not to differences between the individuals taking the test.
Such data are analyzed by examining the paired differences.
- parallelism assumption:
-
For analysis of covariance (ANCOVA),
it is assumed that
the populations
can each be correctly modeled by a straight-line
simple linear regression.
The parallelism assumption is that the regressions all
have the same slope. The assumption can be tested by
a test of equality for slopes. If the assumption of
equality of slopes does not hold, then a subsequent
test of equality of intercepts (elevations) is meaningless,
since it requires that the slopes be equal.
- pooled estimate of the
variance:
- The pooled estimate of the variance is a weighted
average of each individual
sample's
variance estimate.
When the estimates are all estimates of the same variance
(i.e., when the population
variances are equal), then
the pooled estimate is more accurate than any of the
the individual estimates.
- population:
- The population is the universe of all the objects from which
a sample could be drawn for an
experiment. If a representative random sample is chosen, the results of
the experiment should be generalizable to the population from which
the sample was drawn, but not necessarily to a larger population.
For example, the results of medical studies on males may not
be generalizable for females.
- power:
- The power of a test is the probability
of (correctly) rejecting the
null hypothesis
when it is in fact false. The power depends
on the
significance level
(alpha-level) of the test, the components of the
calculation of the test statistic,
and on the specific
alternative hypothesis
under consideration. For the
two-sample unpaired t test,
an alternative
hypothesis would be that the difference
between the two population
means was
some specific non-zero value, such as 1.5;
the components of the test statistic
include the sample sizes, sample means, and sample variances.
The greater the power of a two-sample
unpaired t test, the better able it is to
correctly reject (i.e., declare significant)
small but real differences between the
two population means. A power curve plots
the power against the actual difference
between the population means.
- product-limit method:
-
For survival studies, the product-limit
(Kaplan-Meier) estimate
of survival is calculated by dividing time into intervals such
that each interval ends at the time of an observation, whether
censored or uncensored.
The probability of survival is calculated at the end of
each interval, with censored observations assumed to
have occurred just after uncensored ones. The product-limit
survival function is a step function that changes value
at each time point associated with an uncensored value.
- qualitative:
-
Qualitative variables are variables for which an attribute or classification
is measured. Examples of qualitative variables are gender
or disease state.
- quantitative:
-
Quantitative variables are variables for which a numeric value
representing an amount is measured.
- random effects:
-
When the populations included in an experiment are a random subset
of those of interest, then the experiment follows a random-effects design.
In a experiment using a random-effects design, the results of the experiment
apply not only to the populations included in the experiment, but
to the wider set of populations from which the subset was taken.
For example, subjects in a repeated measures
(within factors) design
are considered a random effect because we are interested not in
the particular subjects chosen for the experiment, but the entire
population of potential subjects. Similarly, blocks are
often a random effect in analysis of variance.
Multiple comparisons tests
for an analysis of variance are not applied when the effects
are random.
Whether an effect is to considered random or fixed
may depend on the circumstances. A factory may conduct an experiment
comparing the output of several machines. If those machines are the
only ones of interest (because they constitute the entire set of
machines owned by that company), then machine will be a fixed effect.
If the machines were instead selected randomly from among those
owned by the company, then machine would be a random effect.
- random sample:
-
A random sample of size N is a collection of N objects
that are independent and
identically distributed.
In a random sample, each member
of the population
has an equal chance of becoming part
of the sample.
- random variable:
-
A random variable is a rule that assigns a value to each
possible outcome of an experiment. For example, if an
experiment involves measuring the height of people,
then each person who could be a subject of the
experiment has associated value, his or her height.
A random variable may be discrete (the possible outcomes
are finite, as in tossing a coin) or continuous
(the values can take any possible value along a range,
as in height measurements).
- randomized block
design:
-
A randomized block analysis of variance
design such as one-way blocked ANOVA
is created by first grouping the experimental
subjects into blocks such that
the subjects in each block are as similar as possible
(e.g., littermates), and there are as many subjects in each
block as there are levels of the factor of interest,
and then randomly assigning a different level of the factor
to each member of the block, such that each level occurs
once and only once per block. The blocks are assumed not
to interact with the factor.
- rank tests:
-
Rank tests are nonparametric tests
that are calculated by replacing the data by their rank values.
Rank tests may also be applied when the only data available
are relative rankings.
Examples of rank tests include the
Wilcoxon signed rank test, the
Mann-Whitney rank sum test, the
Kruskal-Wallis test, and
Friedman's test.
- repeated measures ANOVA:
-
In a repeated measures ANOVA, there will be
at least one factor that is measured at each level for every subject
in the experiement.
This is a within (repeated measures) factor.
For example, in an experiment in which each subject performs the same
task twice is a repeated measures design, with trial (or trial number)
as the within factor.
If every subject performed the same task twice under each of two conditions,
for a total of 4 observations for each subject, then both trial and
condition would be within factors.
In a repeated measures design,
there may also be one or more factors that are measured at only
one level for each subject, such as gender. This type of factor
is a between or grouping factor.
- residuals:
- A residual is the difference between the observed value
of a response measurement and the value that is fitted under the
hypothesized model. For example, in a
two-sample unpaired t test,
the fitted value for a measurement is the mean of
the sample from which it came, so the residual would be
the observed value minus the sample mean.
- resistant:
- A statistic is resistant if its value does not
change substantially when an arbitrary change,
no matter how large, is made in any small part of the data.
For example, the median is a resistant measure of
location, while the mean is not; the mean can
be drastically affected by making a single data
value arbitrarily large, whereas the median can not.
- robust:
- Robust statistical tests are tests that operate well across a wide
variety of distributions.
A test can be robust for
validity, meaning that it provides P values close to the true ones
in the presence of (slight) departures from its
assumptions. It may also be robust for efficiency,
meaning that it maintains its statistical power (the
probability that a true violation of the
null hypothesis
will be detected by the test) in the presence of
those departures.
- scale:
- The generalized concept of the variability or dispersion of
a distribution.
Typical measures of scale are
variance, standard deviation, range, and
interquartile range.
Scale and spread
both refer to the same general concept of variability.
- shape:
- The general form of a distribution,
often characterized by its skewness
and kurtosis
(heavy or
light tails relative to
a normal distribution).
- significance
level:
- The significance level (also known as the alpha-level) of
a statistical test is the pre-selected probability of (incorrectly)
rejecting the
null hypothesis
when it is in fact true.
Usually a small value such as 0.05 is chosen.
If the P value calculated for a statistical
is smaller than the significance level, the null hypothesis is rejected.
- skewness:
- Skewness is a lack of symmetry in a distribution.
Data from a positively skewed (skewed to the right) distribution
have values that are bunched together below the mean,
but have a long tail above the mean.
(Distributions that are forced to be positive,
such as annual income, tend to be skewed to the right.)
Data from a negatively skewed (skewed to the left) distribution have
values that are bunched together above the mean,
but have a long tail below the mean.
Boxplots may be useful in detecting skewness
to the right
or to the left;
normal probabilty plots
may also be useful in detecting skewness
to the right
or to the left.
- spread:
- The generalized concept of the variability of
a distribution.
Typical measures of spread are
variance, standard deviation, range, and
interquartile range.
Spread and scale
both refer to the same general concept of variability.
- stratification:
-
Stratification involves dividing a sample into homogeneous subsamples
based on one or more characteristics of the population.
For example, samples may be stratified by 10-year age groups,
so that, for example, all subjects aged 20 to 29 are in the same age
stratum in each group.
Like blocking or the use of
covariates, stratification is
often used to control for variation that is not attributable
to the variables under study. Stratification can be done
on data that has already been collected, whereas blocking
is usually done by matching subjects before the data
are collected. Potential disadvantages to
stratification are that the number of subjects in a given
stratum may not be uniform across the groups being studied,
and that there may be only a small number of subjects in
a particular stratum for a particular group.
- structural zeros:
-
The process
that creates the observations
that appear in a
contingency table
may produce cells
in the contingency table in which observations
can never occur. The zero values that must
occur in these cells are structural zeroes.
For example, a contingency table of cancer incidence by sex and
type of cancer must have the value 0 in the cell
for males and ovarian cancer, but the expected
number of males with ovarian cancer will not
be 0 as long as there is are at least 1 male
and 1 ovarian cancer patient among the observations.
A contingency table containing one or more
structural zeroes is an incomplete table.
Pearson's chi-square test for independence
and Fisher's exact test
are not designed for contingency tables with structural zeroes.
- survival function:
-
The survival function is a time to failure
function that gives the probability that an individual
survives (does not experience an event) past a given time.
That is, in a survival experiment where the event is death,
the value of the survival function at time T is the
probability that a subject will die at some time greater than T.
The survival function always has a value between
0 and 1 inclusive, and is nonincreasing.
The function is used to find percentiles for survival
time, and to compare the survival experience of
two or more groups.
The mortality function is simply 1 minus the
survival function.
Other names for the survival function are
survivorship function and
cumulative survival rate.
Related functions are the
hazard function,
the conditional instantaneous probability of the event (failure)
given survival up to that time;
and the death density function, which
represents the unconditional probability that the event occurs exactly at time t.
Steeper survival curves (faster drop off toward 0) suggest
larger values for the hazard or death density functions,
and shorter survival times.
The cumulative hazard function is the integral over time
of the hazard function, and is estimated as the negative logarithm
of the survival function.
- test of independence:
-
A test of independence for a
contingency table
tests the null hypothesis
that the row classification factor
and the column classification factor
are independent.
Two such tests are
Pearson's chi-square test for independence
and Fisher's exact test.
- time to failure distributions:
-
In survival analysis,
data is collected on the time until an event
is observed (or censoring occurs).
Often this event is associated with a failure (such as death
or cessation of function).
The probability distribution
of such times can be represented by different functions. Three of
these are: the survival function,
which represents the probability that the event (failure) has not yet occurred;
the death density function,
which is the instantaneous probability of the event (failure);
and the hazard function,
which is the instantaneous probability
of the event (failure) given that it has not yet occurred.
The cumulative hazard function is the integral over time
of the hazard function, and is estimated as the negative logarithm
of the survival function.
- transformation:
- A transformation of data values is done by applying
the same function to each data value, such as by
taking logarithms of the data.
- truncated distribution:
- A distribution is truncated if
observed values must fall within a restricted range, instead of the
expected range over all possible real values.
For example, a observation from a
normal distribution can take any real value between
-infinity and +infinity. An observation from a truncated normal distribution
might only take on values greater than 0, or less than 2.
- two-sample problem:
- In the two-sample problem, two independent
random samples are
collected, and then the samples are used to test a hypothesis
about the populations
from which the samples came (e.g., whether the
means of the two populations are identical).
- two-way layout:
-
The two-way layout refers to a two-way classification in which there
are two factors affecting the observed
response measurements. Each possible combination of levels
from both factors is observed, usually once each. The
interaction between the two factors is
generally assumed to be 0.
The randomized block design
is one example of a two-way layout.
- violation of assumptions:
-
Statistical hypothesis tests generally make assumptions about the
population(s)
from which the data were
sampled.
For example,
many normal-theory-based tests such as the
t test and
ANOVA
assume that the data are sampled from one or more
normal distributions,
as well as that the variances of the different
populations are the same (homoscedasticity:).
If test assumptions are violated, the test results may not be valid.
- Welch-Satterthwaite t test:
- The Welch-Satterthwaite t test is an alternative to the
pooled-variance
t test, and is used when the assumption that the two
populations
have equal variances seems unreasonable. It provides a t statistic that
asymptotically (that is, as the sample sizes become large) approaches
a t distribution,
allowing for an approximate t test to be calculated
when the population variances are not equal.
- within effects:
-
In a repeated measures ANOVA, there will be
at least one factor that is measured at each level for every subject.
This is a within (repeated measures) factor.
For example, in an experiment in which each subject performs the same
task twice, trial number is a within factor.
There may also be one or more factors that are measured at only
one level for each subject, such as gender. This type of factor
is a between or grouping factor.