A wedge-shaped fan pattern like the profile of a megaphone, with a noticeable flare either to the right or to the left as shown in the picture suggests that the variance in the values increases in the direction the fan pattern widens (usually as X increases), and this in turn suggests that a transformation of the X or Y values or a weighted least squares linear regression, may be appropriate.
Points that are far from the others may be outliers in the data, or may suggest a nonnormal population distribution for Y. If an outlier is a high-leverage point, it may pull the fitted function toward it and perhaps away from the main body of the data, and may not appear as an outlier in the plot of fitted Y against X. Alternatively, a high-leverage point may make other points appear to be outliers by drawing the fitted function toward itself.
You may be able to gain additional insight from examining plots of the observed Y values against individual X variables before you perform the regression. The plots below illustrate four different scenarios for plots of the observed Y against individual X:
1. A linear relationship between X and Y seems reasonable.
2. The points seem to follow a curve, not a straight line; a linear relationship between X and Y does not appear to be appropriate for these data, A transformation may create a data set for which a linear fit is appropriate, or a nonlinear model may provide a better fit.
3. The majority of the points seem to follow a linear trend, but there is an outlier which may cause the fitted equation to lie such that it does not provide a good fit to the majority of the data points. An alternative regression method may provide a better fit. The outlying data point should also have its X and Y values doublechecked, in case a recording error has been made.
4. The majority of the points lie on a vertical straight line, and only the presence of an outlier has created any variation in X. This situation may cause the fitted equation to go through the one outlier, so that it will not turn up as a large residual.
These examples demonstrate the importance of examining plots of the data whenever a regression is to be done.
An observation with leverage greater than 2p/n, where p is the number of coefficients (including the intercept), and n the number of observations, is a high-leverage point, and is likely to be an outlier. (The average value of leverage is p/n.) Other potential signs of high leverage for a observation are if one observation has a much greater leverage value than all the others, or if its leverage is greater than 0.5.
Because points with high leverage pull the fitted equation toward them, they may have small residuals, and thus not stand out in a plot of residuals against fitted values. A raw residual can be adjusted for the leverage for the corresponding observation in various ways, producing internally studentized residuals, deleted residuals, and externally studentized residuals, also known as studentized deleted residuals. In each case, points with high leverage will tend to have larger adjusted residuals than raw residuals. An observation with a studentized deleted residual greater than 2 in absolute value is likely to be an outlier in Y.
In cases of severe multicollinearity, it may not be possible to calculate some of the diagnostic measures of leverage or influence. These diagnostics also are not calculated if the fit is exact.
DFFITS measures how much the value fitted Y changes when the ith point is removed from the data set. Large absolute values of DFFITS (greater than 1 for smaller data sets or greater than twice the square root of p/n, where p is the number of coefficients including the intercept, and n the number of data points) suggest that the corresponding data point is influential.
Cook's distance measures the combined influence of the ith point on all the regression coefficients. It takes on greater values for data points with large residuals, large leverage values, or both.
COVRATIO measures the change in the variance-covariance matrix with and without the ith point. It takes on greater values for data points with large leverage values, and tends to be small when when the studentized deleted residual is large.
DFBETAS measure the influence of a data point on a particular coefficient. A Large absolute value of DFBETAS(i,j) (greater than 1 for smaller data sets or greater than twice the square root of 1/n, where n is the number of data points) suggests that the ith point influences the jth coefficient.
If two or more influential points are near each other, then each may mask the effect of deleting the other(s), and then none of them may have a large value for these influence measures. You may be able to spot such clumps of points in graphs of Y, fitted Y, or residuals vs individual X.
In cases of severe multicollinearity, it may not be possible to calculate some of the diagnostic measures of leverage or influence. These diagnostics also are not calculated if the fit is exact.
A failure of the test for fit to reject the null hypothesis of zero coefficients may also happen when the linear model is not appropriate. Conversely, a significant test result does not necessarily mean that the linear model is the correct one, only that fitting a multiple linear function provides a better estimate of Y than simply using the mean of Y.
The R-square statistic and the multiple correlation coefficient are descriptive measures of how strong the linear association is between the observed and fitted Y values, but they are not tests of goodness of fit per se. Other measures of fit such as the adjusted R-square and Akaike information criterion (AIC) are designed to take into account the number of X variables in the model Because R-square can never decrease as new X variables are added, the adjusted R-square or AIC may give a better idea of how the strength of the association between the observed and fitted Y values has changed as X variables are added to or deleted from the model. The adjusted R-square may in fact decrease if a new X variable does not substantially increase the amount of variation in Y explained by the X variables.
Other informal signs of multicollinearity are
The normality test will give an indication of whether the population from which the Y values were drawn appears to be normally distributed, but will not indicate the cause(s) of the nonnormality. The smaller the sample size, the less likely the normality test will be able to detect nonnormality.
If the residuals do not appear to be close to following a normal distribution, then transforming the Y variable may be a reasonable alternative.
A wedge-shaped fan pattern like the profile of a megaphone, with a noticeable flare either to the right or to the left as shown in the picture suggests that the variance in the values increases in the direction the fan pattern widens (usually as the fitted value increases), and this in turn suggests that a transformation of the Y values or a weighted least squares linear regression, may be appropriate.
Outliers may appear as anomalous points in the graph (although an outlier may not be apparent in the residuals plot if it also has high leverage, drawing the fitted functions toward it).
Other systematic pattern in the residuals (like a linear trend) suggest either that there is another X variable that should be considered in analyzing the data, or that a transformation of X or Y is needed.
A wedge-shaped fan pattern like the profile of a megaphone, with a noticeable flare either to the right or to the left as shown in the picture suggests that the variance in the values increases in the direction the fan pattern widens (usually as the sample mean increases), and this in turn suggests that a transformation of the X or Y values or a weighted least squares linear regression, may be appropriate.
Points that are far from the others may be outliers in the data, or may suggest a nonnormal population distribution for Y. If an outlier is a high-leverage point, it may pull the fitted function toward it and perhaps away from the main body of the data, and may not appear as an outlier in the plot of residuals against X. Alternatively, a high-leverage point may make other points appear to be outliers by drawing the fitted function toward itself.
Systematic departures from the fitted function (e.g., all the points that are high or low in X have positive residuals while the points with middling values of X have negative residuals) may indicate that a transformation of X, a different linear model, or a nonlinear model may result in a better fit.
Examine the glossary.
Do a keyword search of PROPHET
StatGuide.
Back to StatGuide multiple linear regression page.
Back to StatGuide home page.
©1997 BBN Corporation All rights reserved.