PROPHET StatGuide: Descriptive statistics

Descriptive statistics are summary values that describe features of the distribution based on the data sample. These include statistics of location, statistics of scale, statistics of distributional shape (skewness and heavy-tailedness), quantiles (order statistics) and counts of the data.


Users of descriptive statistics often make implicit assumptions about the underlying distribution. When reporting a measure of location such as the mean or median, we usually think of the underlying distribution as having a single "center" or "middle," such as the center "hump" in a normal distribution. Or we may assume that the distribution is continuous. The definitions of the statistics may be perfectly valid without those assumptions, so we must be careful in interpreting the numbers.

Descriptive statistics are estimates, and are more accurate for larger sample sizes than for smaller ones.

Although descriptive statistics can provide a few pieces of information about data or their underlying distribution, they seldom give as good an overall picture of the distribution as a boxplot, histogram, normal probability plot, or other graph of the data. One or two graphs may give you a much better idea of what your data "look like" than a raft of numeric statistics. At the very least, graphs will help you interpret the descriptive statistics.

Prophet descriptive statistics include:


Location statistics:
The most common descriptive statistics are those that measure location, or central tendency--the generalized concept of the "average" value of a distribution.

The sample arithmetic mean, also known simply as the mean or average, is the sum of all the sample values divided by the sample size. It is the best estimate of the expectation (mean) of the underlying population. It is also the center of gravity of the histogram of the sample-- if the histogram where constructed out of cardboard or sheet metal, the mean value would be the fulcrum point where the histogram would balance horizontally.

Because the mean is calculated from all the sample values, it makes the maximum possible use of the available data. On the other hand, it can be influenced by any extreme value; i.e., it is not resistant. In using the mean, you should always check for the presence of outliers in the sample.

One method of dealing with the problem of outlying values is to use weighting or trimming in calculating the mean.

In the usual mean calculation, all the sample values are given the same weight (1/sample size) in the sum. This can be adjusted to using any collection of weights that sum to 1.

For a trimmed mean a proportion (e.g., 10%) of the data at each end of the sample is trimmed off, and then the arithmetic mean is calculated for the remaining values. The 10% trimmed mean for a sample size of 20 would be the average of the middle 16 values. This is equivalent to weighting those 16 values equally and assigning weights of 0 to the other 4 trimmed values.

Other weighted means can be calculated by using a weighting function such as the biweight or Winsorized means. The weighting function may depend on the size of the individual values. Chapters 10 and 11 of Hoaglin et al. discuss trimming and weighting in detail.

The confidence interval for the sample mean as reported in Prophet is the half-width of the 95% confidence interval for the mean of a normal distribution, calculated as for the one-sample t test.

The sample median is the "middle" value of the sample. There are as many sample values above the sample median as below it. If the sample size is odd (say, 2N + 1), then the median is the Nth largest data value. If the sample size is even (say, 2N + 2), then the median is defined as the average of the Nth and (N+1)st largest data values. The sample median will divide the histogram into two pieces with equal areas. The sample median is the best estimate of the median of the underlying population.

Because the median is calculated from only one or two data values, it is highly resistant, and may be preferred to the mean when dealing with skewed data. For skewed distributions, the sample mean will be further toward the direction of skew than the median: above the median for distributions skewed to the right, and below the median for distributions skewed to the left.

For symmetric distributions, the mean and median will be the same, and the sample mean and sample median will be estimating the same value. Since the sample mean is the better estimator in this case, especially if the population distribution is normal, the mean is generally preferred unless there is some reason to suspect nonnormality, especially asymmetry.

The confidence interval for the sample median as reported in Prophet is the half-width of a robust 95% confidence interval for the median of a symmetric but possibly heavy-tailed distribution, as described in Chapter 12 of Hoaglin et al.

The sample geometric mean is designed for averaging ratio or proportion data. It is equivalent to taking logarithms the sample values (i.e., transforming the sample), finding the arithmetic mean of the logs, and then retransforming back to the original scale (by taking antilogs). It can only be used when all the sample values are greater than 0.

The sample mode is the single most frequently occurring data value. Samples from a continuous distribution may not have any repeated data values, so the mode is generally more informative with samples from discrete distributions.

If the logarithm tranformation above is replaced by the reciprocal transformation, the result is the sample harmonic mean, which is sometimes used to average rates.

A mode looks like a hump in a graph of the frequency distribution of the sample or population. A sample (or the underlying distribution) may have more than one mode, although Prophet will only report a mode if there is a single one. If the distribution is unimodal, like the normal distribution, and also symmetric, then the sample mean, the sample mode, and the sample median are all estimates of the same value, the population mean.

The sample mode is less sensitive to skewness than either the sample mean or the sample median, but it is more subject to sample variation than either the sample mean or the sample median.

The sample midrange is the midpoint of the sample--the average of the smallest and largest data values in the sample. Like the sample median, it uses only a small portion of the data, but can be heavily affected by outliers, even more so than the sample mean. The mean daily temperature reported in newspapers is usually in fact a midrange.

The letter values display includes the midrange, as well as midpoints between other quantiles. A series of such midpoints can provide information about the skewness and heavy-tailedness of the distribution, but the midrange by itself does not provide much information.

The sample sum is simply the sum of all the sample data values. It is identical to the mean multiplied by the sample size.

Scale statistics:
Scale statistics measure the variability or dispersion of the sample data, how scattered (or, conversely, clustered) the data are about the center of the distribution.

The sample variance is the the average of the squared deviations of each sample value from the sample mean, except that instead of dividing the sum of the squared deviations by the sample size N, the sum is divided by N-1. This is done to make the sample variance an unbiased estimator of the population variance.

The sample standard deviation is the square root of the sample variance. This means that it has the same linear units as the original data values or a measure of central tendency, instead of the squared units of the sample variance.

Like the sample mean, the sample variance and sample standard deviation make use of all the available sample data, and can be heavily influenced by an extreme value, or by skewed data. Because the sample variance and standard deviation are based on squared deviations, a single aberrant value can make a huge difference in the calculated sample statistic. In using these sample statistics, you should always check for the presence of outliers in the sample.

A related statistic, the mean absolute deviation, is the mean of the sum of the absolute values of the deviations of each value from the mean (or, sometimes, from the sample median). Like the variance and standard deviation, it can be influenced by even a single outlier, but because the deviations are not squared, the effect is not as pronounced.

The sample standard error of the mean is the sample standard deviation divided by the square root of the sample size. It is simply the estimate of the standard deviation of the sample mean, and shares both the advantages and lack of resistance of the sample standard deviation.

The sample coefficient of variation is the sample standard deviation divided by the sample mean, sometimes multiplied by 100 to give a percentage. It measures relative variability by correcting for the magnitude of the data values, and thus giving a measure that has no units. It is a biased estimator of the population coefficient of variation.

If two populations are identical except for a change of scale, then the coefficients of variation will be the same. Thus the coefficient of variation is often used to compare the variability of populations that are somehow related, but have different orders of magnitude, such as body weights of elephants vs shrews.

The sample sum of squares is the sum of the squared squared deviations of each sample value from the sample mean, and is simply the sample variance multiplied by one less than the sample size.

The sample range is the difference between the maximum and minimum values in the sample. Like the sample midrange, it uses only a small portion of the data, but can be heavily affected by outliers. It is also not a very good estimator of the population range, since it is biased and highly variable. Its best use is in conjunction with another scale statistic like the sample standard deviation.

The sample interquartile range is the difference between the upper (75th percentile) and lower (25th percentile) quartiles of the data sample, which are the upper and lower bounds of the center half of the data values. It does not use all the available data, but only on the central half of the data. It is less likely to be heavily affected by outliers or skewness (which mostly affects values in the tails) than either the range or the standard deviation, but is not the best estimator when the population is known to be normal or nearly so.

Shape and distribution statistics:
Shape statistics measure how the shape of the underlying population differs from the shape of a normal distribution with the same mean and variance. Boxplots, histograms, and normal probability plots often help in interpreting shape statistics.

The sample skewness measures asymmetry. A symmetric distribution has 0 skewness, a distribution skewed to the right (long righthand tail) has positive skewness, and a distribution skewed to the left (long lefthand tail) has negative skewness. Outliers in a sample from a symmetric distribution can produce a non-zero sample skewness statistic. A boxplot or A normal probability plot of the sample can provide information as to whether the this might be the case.

The sample kurtosis measures heavy-tailedness or light-tailedness relative to the normal distribution. A light-tailed distribution like the uniform distribution has fewer values in the tails (away from the center of the distribution) than the normal distribution, and will have negative kurtosis. A heavy-tailed distribution like the Cauchy distribution has more values in the tails (away from the center of the distribution) than the normal distribution, and will have positive kurtosis. Outliers in a sample from a distribution with normal tails can produce a non-zero sample kurtosis statistic. A boxplot or A normal probability plot of the sample can provide information as to whether the this might be the case.

A sample from a distribution with long tails (positive) kurtosis may also have a sizeable non-zero skewness statistic, even if the underlying distribution is symmetric. Both the sample skewness and sample kurtosis statistics make use of all the data values, and, like the mean and standard deviation, are sensitive to outliers

The normality test gives a P value for the Shapiro-Wilk omnibus test of normality. (If the sample size is greater than 2000, Stephens' test of normality is performed.) This test detects departures from normality, but will not indicate the type of nonnormality (e.g., skewness vs heavy-tailedness).

Quantiles:
Quantiles are order statistics, or averages of two order statistics, chosen so that a certain proportion of the sorted data values fall below the quantile. The median is the 50th percentile, because 50% of the data values fall below it, and a quantile.

The maximum and minimum sample values are also quantiles, as the 0th and 100th percentiles. The are also known as the extremes.

The letter values display is made of a specific set of quantiles, such that the proportion that falls below each quantile is a power of 1/2. The median is the first such quantile. The next such quantiles are the lower and upper quartiles. The lower quartile, Q1, is the 25th percentile. The upper quartile, Q3, is the 75th percentile. Q3-Q1 is the interquantile range. If the distance between and median and Q3 is greater than that between the median and Q1, the distribution may be skewed to the right. If the distance between and median and Q3 is less than that between the median and Q1, the distribution may be skewed to the left. The center box of a boxplot is constructed from Q1 and Q3, along with the median.

Counts:
The sample size is the number of (non-empty) values in the sample.

The number missing is the number of empty (missing) values in the sample. Prophet calculates the number of missing values on a per-column basis, so that empty values are not counted as missing in a column if they occur after the row with the last non-empty value in that column.

The number of unique values is the number of different values in the sample. This is useful for checking for incorrectly entered values in a sample from a discrete distribution, or as a very crude indication of clumpiness in a sample from a continuous distribution.

If you are not familiar with descriptive statistics, you are advised to consult with a statistician. Failure to understand descriptive statistics may result in drawing erroneous conclusions from your data. You may also want to consult the following references:

Examine the glossary.

Do a keyword search of PROPHET StatGuide.

Back to StatGuide home page.

Last modified: February 20, 1997

©1996 BBN Corporation All rights reserved.