John Whitehead
Department of Economics
Appalachian State University
whiteheadjc@appstate.edu
Purpose | Uses | The Model | OLS | Overall Fit | Hypothesis Testing | Problems | Extensions
Why should a researcher use multiple, or multivariate, regression instead of bivariate (simple) regression?
"Oftentimes, two or more variables have separate effects that cannot be isolated. For example, if one group of professors use textbook A and another group of professors use textbook B, and students in the classes were given tests to see how much they had learned, the independent variables (the textbooks and professors' teaching effectiveness) would be confounded. It would be difficult to tell whether differences in test scores were caused by either or both independent variables if bivariate regression were used." (Source: Dictionary of Statistics and Methodology: A Nontechnical Guide for the Social Sciences, by W. Paul Vogt, Sage, 1993.)
Multiple regression allows the researcher to tell whether differences were caused by either or both variables by holding constant the confounding variable when analyzing the other variable.
Instead of an abstract model
EXPEND = f(INCOME, AGE)
where EXPEND (vacation expenditures) increases with INCOME (income in thousands) and decreases with AGE (the age of the tourist), we get a more descriptive picture of reality, such as:
EXPEND = 100 + 30 × INCOME - 10 × AGE
where we now know that for every unit that INCOME increases, EXPEND increases by $30 and for every unit that AGE increases, EXPEND decreases by $10.
Given test statistics on the numbers above, we can determine if these are "statistically significant." Statistical significance indicates the confidence we can place in the quantitative regression results. For example, it is important to know whether there is a 5% or a 50% chance that the true effect of INCOME on EXPEND is zero.
Suppose we want to predict what will happen to EXPEND if INCOME increases by 10%. If average income is $30, simply plug INCOME=3 into the model:
EXPEND = 100 + 30 × (INCOME=3) - 10 × AGE
and predict that EXPEND will increase by $90 if INCOME increases by $3000, holding the age of the tourist constant.
These uses do not differ from simple regression, but, results will be less likely misleading due to confounding effects.
The multiple regression model contains a dependent variable (Y), (more than one) independent variables (X_{1}, X_{2}), and the error term (e), where
v Y depends on the Xs
v Y and the Xs are continuous variables
The linear regression model is of the form
Y = B_{0} + B_{1}X_{1} + B_{2}X_{2} + e
The B_{0} is the intercept coefficient (multiplied by the constant, 1) and B_{1} and B_{2} coefficients are the slopes or rate of change in Y for each unit change in X_{1} and X_{2}.
The error term is added to the model to introduce all the variation in the dependent variable that cannot be explained by the independent variables. It is the difference between the observed Y and the true regression equation.
Also, qualitative independent variables (i.e. 0,1 dummies) can be easily accommodated in linear regression. Suppose that a dummy variable, D, is equal to one if the subject of the analysis is a male and zero if the subject is a female. The model becomes:
Y = B_{0} + B_{1}X_{1} + B_{2}X_{2} + B_{3}D + e
and the intercept is interpreted as equal to:
B_{0} + B_{3}, if
male
B_{0}, if female
The estimated regression equation is:
Y = ß_{0} + ß_{1}X_{1} + ß_{2}X_{2} + ß_{3}D + ê
where the ßs are the OLS estimates of the Bs. OLS minimizes the sum of the squared residuals
OLS minimizes SUM ê^{2}
The residual, ê, is the difference between the actual Y and the predicted Y and has a zero mean. In other words, OLS calculates the slope coefficients so that the difference between the predicted Y and the actual Y is minimized. (The residuals are squared in order to compare negative errors to positive errors more easily.)
The OLS estimates of the ßs:
v are unbiased - the ßs are centered around the true population values of the Bs
v have minimum variance - the distributions of the ß estimates around the true Bs are as tight as possible
v are consistent - as the sample size (n) approaches infinity, the estimated ßs converge on the true Bs
v are normally distributed - statistical tests based on the normal distribution can be applied to these estimates.
Statistical computing packages such as SPSS routinely print out the estimated ßs when estimating a regression equation (i.e. ols1.txt).
We hope that our regression models will explain the variation in the dependent variable fairly accurately. If it does, we say that "the model fits the data well." Evaluating the overall fit of the model also helps us to compare models that differ with the data set, composition and number of independent variables, etc.
There are three primary statistics for evaluating overall fit:
The coefficient of determination, R^{2}, is the ratio of the explained sum of squares (ESS) to the total sum of squares (TSS):
R^{2} = ESS/TSS = SUM([Y - ê] - µ_{Y})^{2} / SUM(Y - µ_{Y})^{2}
where ESS is the summation of the squared values of the difference between the predicted Ys (Y - ê) and the mean of Y (µ_{Y}, a naive estimate of Y) and TSS is the summation of the squared values of the difference between the actual Ys and the mean of Y.
The R^{2} ranges from 0 to 1 and can be interpreted as the percentage of the variance in the dependent variable that is explained by the independent variables.
Adding a variable to a multiple regression equation virtually guarantees that the R^{2} will increase (even if the variable is not very meaningful). The adjusted R^{2} statistic is the same as the R^{2} except that it takes into account the number of independent variables (k). The adjusted R^{2} will increase, decrease or stay the same when a variable is added to an equation depending on whether the improvement in fit (ESS) outweighs the loss of the degree of freedom (n-k-1):
adjusted R^{2} = 1 - (1 - R^{2}) × [(n - 1)/(n - k - 1)]
The adjusted R^{2} is most useful when comparing regression models with different numbers of independent variables.
The F statistic is the ratio of the explained to the unexplained portions of the total sum of squares (RSS=SUM ê^{2}), adjusted for the number of independent variables (k) and the degrees of freedom (n-k-1):
F = [ESS/k] / [RSS/(n - k - 1)]
The F statistic allows the researcher to determine whether the whole model is statistically significant from zero.
Statistical computing packages such as SPSS routinely print out this stuff (i.e. ols2.txt).
What is a 'good' overall fit? It depends. Cross-sectional data will often produce R^{2}s that seem quite low; R^{2}=.07 might be good for some types of data while for others it might be very, very bad. The adjusted R^{2}, F-stat, and hypothesis tests of indepedent variables are all important determinants of model fit.
Because most data consists of samples from the population, we worry whether our ßs actually matter when explaining variation in the dependent variable.
The null hypothesis states that X is not associated with Y, therefore the ß is equal to zero; the alternative hypothesis states that X is associated with Y, therefore the ß is not equal to zero.
The t-statistic is equal to the ß divided by the standard error of ß (s.e., a measure of the dispersion of the ß)
t = ß/s.e.
A (very) rough guide to testing hypotheses might be: "t-statistics above 2 are good." Also check your t-tables and significance (confidence) levels.
Statistical computing packages such as SPSS routinely print out the standard errors, t-stats, and confidence levels (the probability that ß is not zero) when estimating a regression equation (i.e. ols3.txt).
How do you choose which variables to include in your model?
Problem |
Detection |
Consequences |
Correction |
Omitted variable |
On the basis of theory, significant unexpected signs or poor model fit |
The estimated coefficients are biased and inconsistent |
Include the left out variable or a proxy |
Irrelevant variable |
Theory, t-test, effect on the other coefficients and adjusted R^{2} if the irrelevant variable is dropped |
Lower adjusted R^{2}, higher s.e.s, and lower t-stats |
Delete the variable from the model if it is not required by theory |
Functional form |
Reconsider the underlying relationship between Y and the Xs |
Biased and inconsistent coefficients, poor overall fit |
Transform the variable or the equation to a different functional form |
Specification searches are sometimes call "data mining" (see specbias.txt for an example).
There are several assumptions which must be met for the OLS estimates to be unbiased and have minimum variance. Two of the most easily violated assumptions are:
v No explanatory variable is a perfect linear function of other explanatory variables, or no perfect multicollinearity.
v
The error term has a constant variance. A
nonconstant variance could lead to heteroskedasticity.
Violation |
Detection |
Consequences |
Correction |
Multicollinearity |
Check to see if the adjusted R^{2} is high while the t-stats are low; check to see if the correlation coefficients are high |
The estimated coefficients are not biased but the t-stats will fall |
Drop one of the problematic variables, combine problematic variables as interactions |
Heteroskedasticity |
Plot the residuals against the Xs and look for spread or contraction; use some standard tests |
The estimated coefficients are not biased but the t-stats will be misleading |
Redefine the variables (i.e. in % terms) or weight the data |
Other important assumptions which may be violated with certain types of data are:
v All explanatory variables are uncorrelated with the error term. Correlation would lead to simultaneous equations and endogeneity bias
v The error terms from one observation is independent of the error terms from other observations. Dependence would lead to autocorrelation: a problem with time-series data (i.e. this year's error term depends on last year's error term).
Violations of these assumptions are less likely to occur with many types of data so we'll leave their discussion to the extensions section.
v Interaction effects
Interaction terms are combinations of independent variables. For example, if you think that men earn more per year of work experience than women do, then include an interaction term [multiply the male dummy (D) by the independent variable] along with the experience variable and male dummy.
v Limited and qualitative dependent variables
There are many important research topics for which the dependent variable is qualitative. Researchers often want to predict whether something will happen or not, such as referendum votes, business failure, disease--anything that can be expressed as Event/Nonevent or Yes/No. Logistic regression is a type of regression analysis where the dependent varible is dichotomous and coded 0, 1.
v Two-stage least squares
This is one solution to endogeneity bias. If an independent variable is correlated with the error term (i.e. in a model of the number of years education chosen, the number of children might be chosen at the same time). Two-stage least squares would first predict the number of children (based on other independent variables) and then use the prediction as the independent variable in the education model.
v Time series analysis, and forecasting
If you have yearly, quarterly, or monthly data then the ordering of the observations matters (with cross-section data it doesn't matter if Sally comes before or after Jane). For example, regression models of monthy car sales might include monthly and lagged monthly advertisements. Some standard time-series models should be used to account for the correlation of the lagged advertisements.
These notes are based on: Using Econometrics: A Practical Guide, by A.H. Studenmund, 3rd Edition, 1997. See also: Applied Regression: An Introduction, by Michael S. Lewis-Beck, Sage University Paper Series on Quantitative Applications in the Social Sciences, 07-022.