# PROPHET StatGuide: Do your data violate multiple linear regression assumptions?

If the X or Y populations from which data to be analyzed by multiple linear regression were sampled violate one or more of the multiple linear regression assumptions, the results of the analysis may be incorrect or misleading. For example, if the assumption of independence is violated, then multiple linear regression is not appropriate. If the assumption of normality is violated, or outliers are present, then the multiple linear regression goodness of fit test may not be the most powerful or informative test available, and this could mean the difference between detecting a linear fit or not. A non-least-squares, robust, or resistant regression method, a transformation, a weighted least squares linear regression, or a nonlinear model may result in a better fit. If the population variance for Y is not constant, a weighted least squares linear regression or a transformation of Y may provide a means of fitting a regression adjusted for the inequality of the variances. If fitted coefficients are unstable because of multicollinearity in the X variables, then a method designed to deal with multicollinearity may provide a more useful fit.

Often, the impact of an assumption violation on the multiple linear regression result depends on the extent of the violation (such as the how inconstant the variance of Y is, or how skewed the Y population distribution is). Some small violations may have little practical effect on the analysis, while other violations may render the multiple linear regression result uselessly incorrect or uninterpretable.

#### Potential assumption violations include:

Implicit independent variables (covariates):
Apparent lack of independence in the fitted Y values may be caused by the existence of an implicit X variable in the data, an X variable that was not explicitly used in the linear model. In this case, the best model may still be linear, but may not include all the original X variables. If there is a linear trend in the plot of the regression residuals against the fitted values, then an implicit X variable may be the cause. A plot of the residuals against the prospective new X variable should reveal whether there is a systematic variation; if there is, you may consider adding the new X variable to the linear model.

A "new" X variable might be derived from one or more X variables already in the equation, such as using the square of X1 along with X1 to handle curvature in X1, or adding X1*X2 as a new variable to handle interaction between X1 and X2.

If an implicit X variable is not included in the fitted model, the fitted estimates for the coefficients may be biased, and not very meaningful, and the fitted Y values may not be accurate.

Another possible cause of apparent dependence between the Y observations is the presence of an implicit block effect. (The block effect can be considered another type of implicit X variable, albeit a discrete one.) If a blocking variable is suspected, an analysis of covariance can be performed, essentially dividing the data into different regression equations based on the value of the blocking variable.

If multiple values of Y are collected at the same values of X, this can act as another type of blocking, with the unique combinations of values of the Xs acting as blocks. These multiple Y measurements may be less variable than the overall variation in Y, and, given their common values of the Xs, they are not truly independent of each other. If there are many replicated X values, and if the variation between Y at replicated values is much smaller than the overall residual variance, then the variance of the estimate of the coefficients may be too small, making the test of whether they are 0 (and, the test of the goodness of the overall fit) anticonservative (more likely than the stated significance level to reject the null hypothesis, even when it is true). In this case, an alternative method is to replace each replicated unique combination of X values by a single data point with the average Y value, and then perform the regression analysis with the new data set. A possible drawback to this method is that by reducing the number of data points, the degrees of freedom associated with the residual error is reduced, thus potentially reducing the power of the test.

Lack of independence in Y:
Whether the Y values are independent of each other is generally determined by the structure of the experiment from which they arise. Y values collected over time may be serially correlated (here time is the implicit factor). If the data are in a particular order, consider the possibility of dependence. (If the row order of the data reflect the order in which the data were collected, an index plot of the data [data value plotted against row number] can reveal patterns in the plot that could suggest possible time effects.)

For serially correlated error terms, the estimates of the coefficients will be unbiased, but the estimates of their variances will not be reliable. If they are positively serially correlated, the estimate of residual variance and the estimates of the variances of the coefficients may all be too small, making the tests and confidence intervals that involve them unreliable. This kind of serial correlation may appear when there are one or more implicit X variables.

If you are unsure whether your Y values are independent, you may wish to consult a statistician or someone who is knowledgeable about the data collection scheme you are using.

Multicollinearity:
If two or more of the X variables are nearly linear combinations of each other, the X variables are multicollinear. In this situation, you may be able to find a good multiple linear fit for Y, but the values of the individual coefficients may be highly variable. Thus, you might be able to predict Y with reasonable accuracy, but you would not be able to draw any reliable conclusions about the coefficients. And the fitted coefficients can vary widely from sample to sample of data, or if a single X variable is added or deleted from the equation. Various formal and informal diagnostics may help detect multicollinearity. There are also some methods designed to deal with multicollinearity.

In cases of severe multicollinearity, it may not be possible to calculate some of the diagnostic measures of influence or leverage, or even to perform the fit itself. In such cases, the data are said to be ill-conditioned.

Outliers:
Values may not be identically distributed because of the presence of outliers. Outliers are anomalous values in the data. Outliers may have a strong influence over the fitted coefficients, giving a poor fit to the bulk of the data observations. Outliers tend to increase the estimate of residual variance, lowering the chance of rejecting the null hypothesis. They may be due to recording errors, which may be correctable, or they may be due to the Y values not all being sampled from the same population. Apparent outliers may also be due to the Y values being from the same, but nonnormal, population. Outliers may show up clearly in a scatterplot of Y and one of the X variables, as points that do not lie near the general trend of the data. However, a point may be an unusual value in either X or Y without necessarily being an outlier in the scatterplot.

Once the regression line has been fitted, the boxplot and normal probability plot (normal Q-Q plot) for residuals may suggest the presence of outliers in the data. After the fit, outliers are usually detected by examining the residuals or the high-leverage points.

The method of least squares involves minimizing the sum of the squared vertical distances between each data point and the fitted line. Because of this, the fitted line can be highly sensitive to outliers. (In other words, least squares regression is not resistant to outliers, and thus, neither are the fitted coefficient estimates.) An outlier may act as a high-leverage point, distorting the fitted equation and perhaps fitting the main body of the data poorly.

If you find outliers in your data that are not due to correctable errors, you may wish to consult a statistician as to how to proceed.

Nonnormality:
The values in a sample may indeed be from the same population, but not from a normal one. Signs of nonnormality are skewness (lack of symmetry) or light-tailedness or heavy-tailedness. The boxplot, histogram, and normal probability plot (normal Q-Q plot), along with the normality test, can provide information on the normality of the population distribution. However, if there are only a small number of data points, nonnormality can be hard to detect. If there are a great many data points, the normality test may detect statistically significant but trivial departures from normality that will have no real effect on the multiple linear regression's tests (since, for example, the t statistic for the test of a coefficient will converge in probability to the standard normal distribution by the law of large numbers).

For data from a normal distribution, normal probability plots should approximate straight lines, and boxplots should be symmetric (median and mean together, in the middle of the box) with no outliers. Except for substantial nonnormality that leads to outliers in the X-Y data, if the number of data points is not too small, then the multiple linear regression statistic will not be much affected even if the population distributions are skewed.

Robust statistical tests operate well across a wide variety of distributions. A test can be robust for validity, meaning that it provides P values close to the true ones in the presence of (slight) departures from its assumptions. It may also be robust for efficiency, meaning that it maintains its statistical power (the probability that a true violation of the null hypothesis will be detected by the test) in the presence of those departures. Linear regression is fairly robust for validity against nonnormality, but it may not be the most powerful test available for a given nonnormal distribution, although it is the most powerful test available when its test assumptions are met. In the case of nonnormality, a non-least-squares regression method, or employing a transformation of one or more X variables may result in a more powerful test.

Variance of Y not constant:
If the variance of the Y is not constant, then the the error variance will not be constant. The most common form of such heteroscedasticity in Y is that the variance of Y may increase as the mean of Y increases, for data with positive X and Y.

Unless the heteroscedasticity of the Y is pronounced, its effect will not be severe: the least squares estimates will still be unbiased, and the estimates of the coefficients will either be normally distributed if the errors are normally distributed, or at least normally distributed asymptotically (as the number of data points becomes large) if the errors are not normally distributed. The estimate for the variance of the coefficients will be inaccurate, but the inaccuracy is not likely to be substantial if the X values are symmetric about their means.

Heteroscedasticity of Y is usually detected informally by examining the X-Y scatterplots of the data before performing the regression. If both nonlinearity and unequal variances are present, employing a transformation of Y may have the effect of simultaneously improving the linearity and promoting equality of the variances. Otherwise, a weighted least squares multiple linear regression may be the preferred method of dealing with nonconstant variance of Y.

The correct model is not linear:
If the linear model is not the correct one for the data, then the coefficient estimates and the fitted values from the multiple linear regression will be biased, and the fitted coefficient estimates will not be meaningful. Over a restricted range of X or Y, nonlinear models may be well approximated by linear models (this is in fact the basis of linear interpolation), but for accurate prediction a model appropriate to the data should be selected. An examination of the X-Y scatterplots may reveal whether the linear model is appropriate. If there is a great deal of variation in Y, it may be difficult to decide what the appropriate model is; in this case, the linear model may do as well as any other, and has the virtue of simplicity.

One or more X variables are random, not fixed:
The usual multiple linear regression model assumes that the observed X variables are fixed, not random. If the X values are are not under the control of the experimenter (i.e., are observed but not set), and if there is in fact underlying variance in the X variables, but they have the same variance, the linear model is called the errors-in-variables model or the structural model. The least squares fit will still give the best linear predictor of Y, but the estimates of the coefficients will be biased.

Patterns in plot of data:
If the assumption of the linear model is correct, the plot of the observed Y values against X should suggest a linear band across the graph. Outliers may appear as anomalous points in the graph, often in the upper righthand or lower lefthand corner of the graph. (A point may be an outlier in either X or Y without necessarily being far from the general trend of the data.)

If the linear model is not correct, the shape of the general trend of the X-Y plot may suggest the appropriate function to fit (e.g., a polynomial, exponential, or logistic function). Alternatively, the plot may suggest a reasonable transformation to apply. For example, if the X-Y plot arcs from lower left to upper right so that data points either very low or very high in X lie below the equation suggested by the data, while the data points with middling X values lie on or above that straight line, taking square roots or logarithms of the X values may promote linearity.

If the assumption of equal variances for the Y is correct, the plot of the observed Y values against X should suggest a band across the graph with roughly equal vertical width for all values of X. (That is, the shape of the graph should suggest a tilted cigar and not a wedge or a megaphone.)

A fan pattern like the profile of a megaphone, with a noticeable flare either to the right or to the left as shown in the picture suggests that the variance in the values increases in the direction the fan pattern widens (usually as the sample mean increases), and this in turn suggests that a transformation of the Y values may be needed.

Unfortunately, simple X-Y plots may not be as useful in multiple regression as they are for simple linear regression. If there is multicollinearity, then that can cause the plots of Y against individual X values to be misleading. For example, the apparent increase in variance for Y as X1 increases might be due to the effect of other X variables on Y.

Special problems with few data points:
If the number of data points is small, it may be difficult to detect assumption violations. With small samples, violation assumptions such as nonnormality or heteroscedasticity of variances are difficult to detect even when they are present. With a small number of data points multiple linear regression offers less protection against violation of assumptions. With few data points, it may be hard to determine how well the fitted equation matches the data, or whether a nonlinear function would be more appropriate.

If the ratio of the total number of coefficients (including the intercept) to the total number of data points is greater than 0.4, it will often be difficult to fit a reliable model. Many of the individual data points may become influential points, because there is so little information (data) available for each coefficient to be fitted.

A rule of thumb is to aim to have the number of data points be at least 6 times, and ideally at least 10 times, the number of X variables.

Even if none of the test assumptions are violated, a linear regression on a small number of data points may not have sufficient power to detect a significant difference between a coefficient and 0, even if the coefficient is non-zero. The power depends on the residual error, the observed variation in X, the selected significance (alpha-) level of the test, and the number of data points. Power decreases as the residual variance increases, decreases as the significance level is decreased (i.e., as the test is made more stringent), increases as the variation in observed X increases, and increases as the number of data points increases. If a statistical significance test with a small number of data values produces a surprisingly non-significant P value, then lack of power may be the reason. The best time to avoid such problems is in the design stage of an experiment, when appropriate minimum sample sizes can be determined, perhaps in consultation with a statistician, before data collection begins.