An Introduction to Multiple Regression

John Whitehead
Department of Economics
Appalachian State University
[email protected]

Outline

Purpose | Uses | The Model | OLS | Overall Fit | Hypothesis Testing | Problems | Extensions

Purpose

Why should a researcher use multiple, or multivariate, regression instead of bivariate (simple) regression?

"Oftentimes, two or more variables have separate effects that cannot be isolated. For example, if one group of professors use textbook A and another group of professors use textbook B, and students in the classes were given tests to see how much they had learned, the independent variables (the textbooks and professors' teaching effectiveness) would be confounded. It would be difficult to tell whether differences in test scores were caused by either or both independent variables if bivariate regression were used." (Source: Dictionary of Statistics and Methodology: A Nontechnical Guide for the Social Sciences, by W. Paul Vogt, Sage, 1993.)

Multiple regression allows the researcher to tell whether differences were caused by either or both variables by holding constant the confounding variable when analyzing the other variable.

BACK

Multiple regression has three major uses

1. A description or model of reality

Instead of an abstract model

EXPEND = f(INCOME, AGE)

where EXPEND (vacation expenditures) increases with INCOME (income in thousands) and decreases with AGE (the age of the tourist), we get a more descriptive picture of reality, such as:

EXPEND = 100 + 30 � INCOME - 10 � AGE

where we now know that for every unit that INCOME increases, EXPEND increases by $30 and for every unit that AGE increases, EXPEND decreases by $10.

2. The testing of hypotheses about theory

Given test statistics on the numbers above, we can determine if these are "statistically significant." Statistical significance indicates the confidence we can place in the quantitative regression results. For example, it is important to know whether there is a 5% or a 50% chance that the true effect of INCOME on EXPEND is zero.

3. Predictions about the future

Suppose we want to predict what will happen to EXPEND if INCOME increases by 10%. If average income is $30, simply plug INCOME=3 into the model:

EXPEND = 100 + 30 � (INCOME=3) - 10 � AGE

and predict that EXPEND will increase by $90 if INCOME increases by $3000, holding the age of the tourist constant.

These uses do not differ from simple regression, but, results will be less likely misleading due to confounding effects.

BACK

The (Theoretical) Regression Model

The multiple regression model contains a dependent variable (Y), (more than one) independent variables (X₁, X₂), and the error term (e), where

v Y depends on the Xs

v Y and the Xs are continuous variables

The linear regression model is of the form

Y = B₀ + B₁X₁ + B₂X₂ + e

The B₀ is the intercept coefficient (multiplied by the constant, 1) and B₁ and B₂ coefficients are the slopes or rate of change in Y for each unit change in X₁ and X₂.

The error term is added to the model to introduce all the variation in the dependent variable that cannot be explained by the independent variables. It is the difference between the observed Y and the true regression equation.

Also, qualitative independent variables (i.e. 0,1 dummies) can be easily accommodated in linear regression. Suppose that a dummy variable, D, is equal to one if the subject of the analysis is a male and zero if the subject is a female. The model becomes:

Y = B₀ + B₁X₁ + B₂X₂ + B₃D + e

and the intercept is interpreted as equal to:

B₀ + B₃, if male

B₀, if female

BACK

Ordinary Least Squares (OLS)

The estimated regression equation is:

Y = �₀ + �₁X₁ + �₂X₂ + �₃D + �

where the �s are the OLS estimates of the Bs. OLS minimizes the sum of the squared residuals

OLS minimizes SUM �²

The residual, �, is the difference between the actual Y and the predicted Y and has a zero mean. In other words, OLS calculates the slope coefficients so that the difference between the predicted Y and the actual Y is minimized. (The residuals are squared in order to compare negative errors to positive errors more easily.)

The OLS estimates of the �s:

v are unbiased - the �s are centered around the true population values of the Bs

v have minimum variance - the distributions of the � estimates around the true Bs are as tight as possible

v are consistent - as the sample size (n) approaches infinity, the estimated �s converge on the true Bs

v are normally distributed - statistical tests based on the normal distribution can be applied to these estimates.

Statistical computing packages such as SPSS routinely print out the estimated �s when estimating a regression equation (i.e. ols1.txt).

BACK

Evaluating the overall performance of the model

We hope that our regression models will explain the variation in the dependent variable fairly accurately. If it does, we say that "the model fits the data well." Evaluating the overall fit of the model also helps us to compare models that differ with the data set, composition and number of independent variables, etc.

There are three primary statistics for evaluating overall fit:

1. R²

The coefficient of determination, R², is the ratio of the explained sum of squares (ESS) to the total sum of squares (TSS):

R² = ESS/TSS = SUM([Y - �] - �_Y)² / SUM(Y - �_Y)²

where ESS is the summation of the squared values of the difference between the predicted Ys (Y - �) and the mean of Y (�_Y, a naive estimate of Y) and TSS is the summation of the squared values of the difference between the actual Ys and the mean of Y.

The R² ranges from 0 to 1 and can be interpreted as the percentage of the variance in the dependent variable that is explained by the independent variables.

2. Adjusted R²

Adding a variable to a multiple regression equation virtually guarantees that the R² will increase (even if the variable is not very meaningful). The adjusted R² statistic is the same as the R² except that it takes into account the number of independent variables (k). The adjusted R² will increase, decrease or stay the same when a variable is added to an equation depending on whether the improvement in fit (ESS) outweighs the loss of the degree of freedom (n-k-1):

adjusted R² = 1 - (1 - R²) � [(n - 1)/(n - k - 1)]

The adjusted R² is most useful when comparing regression models with different numbers of independent variables.

3. F-stat

The F statistic is the ratio of the explained to the unexplained portions of the total sum of squares (RSS=SUM �²), adjusted for the number of independent variables (k) and the degrees of freedom (n-k-1):

F = [ESS/k] / [RSS/(n - k - 1)]

The F statistic allows the researcher to determine whether the whole model is statistically significant from zero.

Statistical computing packages such as SPSS routinely print out this stuff (i.e. ols2.txt).

What is a 'good' overall fit? It depends. Cross-sectional data will often produce R²s that seem quite low; R²=.07 might be good for some types of data while for others it might be very, very bad. The adjusted R², F-stat, and hypothesis tests of indepedent variables are all important determinants of model fit.

BACK

Hypothesis Testing

Because most data consists of samples from the population, we worry whether our �s actually matter when explaining variation in the dependent variable.

The null hypothesis states that X is not associated with Y, therefore the � is equal to zero; the alternative hypothesis states that X is associated with Y, therefore the � is not equal to zero.

The t-statistic is equal to the � divided by the standard error of � (s.e., a measure of the dispersion of the �)

t = �/s.e.

A (very) rough guide to testing hypotheses might be: "t-statistics above 2 are good." Also check your t-tables and significance (confidence) levels.

Statistical computing packages such as SPSS routinely print out the standard errors, t-stats, and confidence levels (the probability that � is not zero) when estimating a regression equation (i.e. ols3.txt).

BACK

Some problems and solutions

Specification Bias

How do you choose which variables to include in your model?

Problem	Detection	Consequences	Correction
Omitted variable	On the basis of theory, significant unexpected signs or poor model fit	The estimated coefficients are biased and inconsistent	Include the left out variable or a proxy
Irrelevant variable	Theory, t-test, effect on the other coefficients and adjusted R² if the irrelevant variable is dropped	Lower adjusted R², higher s.e.s, and lower t-stats	Delete the variable from the model if it is not required by theory
Functional form	Reconsider the underlying relationship between Y and the Xs	Biased and inconsistent coefficients, poor overall fit	Transform the variable or the equation to a different functional form

Specification searches are sometimes call "data mining" (see specbias.txt for an example).

Violation of Assumptions

There are several assumptions which must be met for the OLS estimates to be unbiased and have minimum variance. Two of the most easily violated assumptions are:

v No explanatory variable is a perfect linear function of other explanatory variables, or no perfect multicollinearity.

v The error term has a constant variance. A nonconstant variance could lead to heteroskedasticity.

Violation	Detection	Consequences	Correction
Multicollinearity (see multicol.txt for an example.)	Check to see if the adjusted R² is high while the t-stats are low; check to see if the correlation coefficients are high	The estimated coefficients are not biased but the t-stats will fall	Drop one of the problematic variables, combine problematic variables as interactions
Heteroskedasticity (see hetero.txt for the Park test.)	Plot the residuals against the Xs and look for spread or contraction; use some standard tests	The estimated coefficients are not biased but the t-stats will be misleading	Redefine the variables (i.e. in % terms) or weight the data

Other important assumptions which may be violated with certain types of data are:

v All explanatory variables are uncorrelated with the error term. Correlation would lead to simultaneous equations and endogeneity bias

v The error terms from one observation is independent of the error terms from other observations. Dependence would lead to autocorrelation: a problem with time-series data (i.e. this year's error term depends on last year's error term).

Violations of these assumptions are less likely to occur with many types of data so we'll leave their discussion to the extensions section.

BACK

Extensions of the Model

v Interaction effects

Interaction terms are combinations of independent variables. For example, if you think that men earn more per year of work experience than women do, then include an interaction term [multiply the male dummy (D) by the independent variable] along with the experience variable and male dummy.

v Limited and qualitative dependent variables

There are many important research topics for which the dependent variable is qualitative. Researchers often want to predict whether something will happen or not, such as referendum votes, business failure, disease--anything that can be expressed as Event/Nonevent or Yes/No. Logistic regression is a type of regression analysis where the dependent varible is dichotomous and coded 0, 1.

v Two-stage least squares

This is one solution to endogeneity bias. If an independent variable is correlated with the error term (i.e. in a model of the number of years education chosen, the number of children might be chosen at the same time). Two-stage least squares would first predict the number of children (based on other independent variables) and then use the prediction as the independent variable in the education model.

v Time series analysis, and forecasting

If you have yearly, quarterly, or monthly data then the ordering of the observations matters (with cross-section data it doesn't matter if Sally comes before or after Jane). For example, regression models of monthy car sales might include monthly and lagged monthly advertisements. Some standard time-series models should be used to account for the correlation of the lagged advertisements.

BACK

These notes are based on: Using Econometrics: A Practical Guide, by A.H. Studenmund, 3rd Edition, 1997. See also: Applied Regression: An Introduction, by Michael S. Lewis-Beck, Sage University Paper Series on Quantitative Applications in the Social Sciences, 07-022.

An Introduction to Multiple Regression

Outline

Purpose

Multiple regression has three major uses

1. A description or model of reality

2. The testing of hypotheses about theory

3. Predictions about the future

The (Theoretical) Regression Model

Ordinary Least Squares (OLS)

Evaluating the overall performance of the model

1. R2

2. Adjusted R2

3. F-stat

Hypothesis Testing

Some problems and solutions

Specification Bias

Violation of Assumptions

Extensions of the Model

1. R²

2. Adjusted R²