If an implicit X variable is not included in the fitted model, the fitted estimates for the slope and intercept may be biased, and not very meaningful, and the fitted Y values may not be accurate.
Another possible cause of apparent dependence between the Y observations is the presence of an implicit block effect. (The block effect can be considered another type of implicit X variable, albeit a discrete one.) If a blocking variable is suspected, an analysis of covariance can be performed, essentially dividing the data into different regression lines based on the value of the blocking variable. If the analysis of covariance shows a significant difference between the slopes in the regression lines, there is evidence that the linear relationship between X and Y varies with the value of the blocking factor.
If multiple values of Y are collected at the same values of X, this can act as another type of blocking, with the unique values of X acting as blocks. These multiple Y measurements may be less variable than the overall variation in Y, and, given their common value of X, they are not truly independent of each other. If there are many replicated X values, and if the variation between Y at replicated values is much smaller than the overall residual variance, then the variance of the estimate of the slope may be too small, making the test of whether the slope is 0 (and, equivalently, the test of the goodness of linear fit) anticonservative (more likely than the stated significance level to reject the null hypothesis, even when it is true). In this case, an alternative method is to replace each replicated X value by a single data point with the average Y value, and then perform the regression analysis with the new data set. A possible drawback to this method is that by reducing the number of data points, the degrees of freedom associated with the residual error is reduced, thus potentially reducing the power of the test.
If you are unsure whether your Y values are independent, you may wish to consult a statistician or someone who is knowledgeable about the data collection scheme you are using.
Once the regression line has been fitted, the boxplot and normal probability plot (normal Q-Q plot) for residuals may suggest the presence of outliers in the data. After the fit, outliers are usually detected by examining the residuals or the high-leverage points.
The method of least squares involves minimizing the sum of the squared vertical distances between each data point and the fitted line. Because of this, the fitted line can be highly sensitive to outliers. (In other words, least squares regression is not resistant to outliers, and thus, neither is the fitted slope estimate.) A point vertically removed from the other points can cause the fitted line to pass close to it, instead of following the general linear trend of the rest of the data, especially if the point is relatively far horizontally from the centroid of the data (the point represented by the mean of X and the mean of Y). Such points are said to have high leverage: the centroid acts as a fulcrum, and the fitted line pivots toward high-leverage points, perhaps fitting the main body of the data poorly. A data point that is extreme in Y but lies near the center of the data horizontally will not have much effect on the fitted slope, but by changing the estimate of the mean of Y, it may affect the fitted estimate of the intercept. A nonparametric or other alternative regression method may be a better method in such a situation. If you find outliers in your data that are not due to correctable errors, you may wish to consult a statistician as to how to proceed.
For data from a normal distribution, normal probability plots should approximate straight lines, and boxplots should be symmetric (median and mean together, in the middle of the box) with no outliers. Except for substantial nonnormality that leads to outliers in the X-Y data, if the number of data points is not too small, then the linear regression statistic will not be much affected even if the population distributions are skewed. Unless the sample sizes are small (less than 10), light-tailedness or heavy-tailedness will have little effect on the linear regression.
Robust statistical tests operate well across a wide variety of distributions. A test can be robust for validity, meaning that it provides P values close to the true ones in the presence of (slight) departures from its assumptions. It may also be robust for efficiency, meaning that it maintains its statistical power (the probability that a true violation of the null hypothesis will be detected by the test) in the presence of those departures. Linear regression is fairly robust for validity against nonnormality, but it may not be the most powerful test available for a given nonnormal distribution, although it is the most powerful test available when its test assumptions are met. In the case of nonnormality, a nonparametric regression method, or employing a transformation of X may result in a more powerful test.
Unless the heteroscedasticity of the Y is pronounced, its effect will not be severe: the least squares estimates will still be unbiased, and the estimates of the slope and intercept will either be normally distributed if the errors are normally distributed, or at least normally distributed asymptotically (as the number of data points becomes large) if the errors are not normally distributed. The estimate for the variance of the slope and variance will be inaccurate, but the inaccuracy is not likely to be substantial if the X values are symmetric about their mean.
Heteroscedasticity of Y is usually detected informally by examining the X-Y scatterplot of the data before performing the regression. If both nonlinearity and unequal variances are present, employing a transformation of Y may have the effect of simultaneously improving the linearity and promoting equality of the variances. Otherwise, a weighted least squares linear regression may be the preferred method of dealing with nonconstant variance of Y.
If the linear model is not correct, the shape of the general trend of the X-Y plot may suggest the appropriate function to fit (e.g., a polynomial, exponential, or logistic function). Alternatively, the plot may suggest a reasonable transformation to apply. For example, if the X-Y plot arcs from lower left to upper right so that data points either very low or very high in X lie below the straight line suggested by the data, while the data points with middling X values lie on or above that straight line, taking square roots or logarithms of the X values may promote linearity.
If the assumption of equal variances for the Y is correct, the
plot of the observed Y values against X
should suggest a band across the graph with roughly
equal vertical width for all values of X.
(That is, the shape of the graph should suggest
a tilted cigar and not a wedge or a megaphone.)
A fan pattern like the profile of a megaphone, with a noticeable flare either to the right or to the left as shown in the picture suggests that the variance in the values increases in the direction the fan pattern widens (usually as the sample mean increases), and this in turn suggests that a transformation of the Y values may be needed.
Even if none of the test assumptions are violated, a linear regression on a small number of data points may not have sufficient power to detect a significant difference between the slope and 0, even if the slope is non-zero. The power depends on the residual error, the observed variation in X, the selected significance (alpha-) level of the test, and the number of data points. Power decreases as the residual variance increases, decreases as the significance level is decreased (i.e., as the test is made more stringent), increases as the variation in observed X increases, and increases as the number of data points increases. If a statistical significance test with a small number of data values produces a surprisingly non-significant P value, then lack of power may be the reason. The best time to avoid such problems is in the design stage of an experiment, when appropriate minimum sample sizes can be determined, perhaps in consultation with a statistician, before data collection begins.
In general, unless there is a structural or theoretical reason to assume that the intercept is 0, it's preferable to fit both the slope and intercept.
Examine the glossary.
Do a keyword search of PROPHET
StatGuide.
Back to StatGuide simple linear regression page.
Back to StatGuide home page.
©1996 BBN Corporation All rights reserved.