Chapter 8
Introduction to Linear Regression
Learning Outcomes
- Define the explanatory variable as the independent variable (predictor), and the response variable as the dependent variable (predicted).
- Plot the explanatory variable (xx) on the x-axis and the response variable (yy) on the y-axis, and fit a linear regression model
y=β0+β1xy=β0+β1x
where β0β0 is the intercept, and β1β1 is the slope.- Note that the point estimates (estimated from observed data) for β0β0 and β1β1 are b0b0 and b1b1, respectively.
- When describing the association between two numerical variables, evaluate
- direction: positive (x↑,y↑x↑,y↑), negative (x↓,y↑x↓,y↑)
- form: linear or not
- strength: determined by the scatter around the underlying relationship
- Define correlation as the \emph{linear} association between two numerical variables.
- Note that a relationship that is nonlinear is simply called an association.
- Note that correlation coefficient (rr, also called Pearson's rr) the following properties:
- the magnitude (absolute value) of the correlation coefficient measures the strength of the linear association between two numerical variables
- the sign of the correlation coefficient indicates the direction of association
- the correlation coefficient is always between -1 and 1, inclusive, with -1 indicating perfect negative linear association, +1 indicating perfect positive linear association, and 0 indicating no \emph{linear} relationship
- the correlation coefficient is unitless
- since the correlation coefficient is unitless, it is not affected by changes in the center or scale of either variable (such as unit conversions)
- the correlation of X with Y is the same as of Y with X
- the correlation coefficient is sensitive to outliers
- Recall that correlation does not imply causation.
- Define residual (ee) as the difference between the observed (yy) and predicted (ˆy^y) values of the response variable.
ei=yi−ˆyiei=yi−^yi
- Define the least squares line as the line that minimizes the sum of the squared residuals, and list conditions necessary for fitting such line:
- linearity
- nearly normal residuals
- constant variability
- Define an indicator variable as a binary explanatory variable (with two levels).
- Calculate the estimate for the slope (b1b1) as
b1=Rsysxb1=Rsysx
, where rr is the correlation coefficient, sysy is the standard deviation of the response variable, and sxsx is the standard deviation of the explanatory variable. - Interpret the slope as
- “For each unit increase in xx, we would expect yy to increase/decrease on average by |b1||b1| units” when xx is numerical.
- “The average increase/decrease in the response variable when between the baseline level and the other level of the explanatory variable is |b1||b1|.” when xx is categorical.
- Note that whether the response variable increases or decreases is determined by the sign of b1b1.
- Note that the least squares line always passes through the average of the response and explanatory variables (ˉx,ˉy¯x,¯y).
- Use the above property to calculate the estimate for the slope (b0b0) as
b0=ˉy−b1ˉxb0=¯y−b1¯x
, where b1b1 is the slope, ˉy¯y is the average of the response variable, and ˉx¯x is the average of explanatory variable. - Interpret the intercept as
- “When x=0x=0, we would expect yy to equal, on average, b0b0.” when xx is numerical.
- “The expected average value of the response variable for the reference level of the explanatory variable is b0b0.” when xx is categorical.
- Predict the value of the response variable for a given value of the explanatory variable, x⋆x⋆, by plugging in x⋆x⋆ in the in the linear model:
ˆy=b0+b1x⋆^y=b0+b1x⋆
- Only predict for values of x⋆x⋆ that are in the range of the observed data.
- Do not extrapolate beyond the range of the data, unless you are confident that the linear pattern continues.
- Define R2R2 as the percentage of the variability in the response variable explained by the the explanatory variable.
- For a good model, we would like this number to be as close to 100% as possible.
- This value is calculated as the square of the correlation coefficient, and is between 0 and 1, inclusive.
- Define a leverage point as a point that lies away from the center of the data in the horizontal direction.
- Define an influential point as a point that influences (changes) the slope of the regression line.
- This is usually a leverage point that is away from the trajectory of the rest of the data.
- Do not remove outliers from an analysis without good reason.
- Be cautious about using a categorical explanatory variable when one of the levels has very few observations, as these may act as influential points.
- Determine whether an explanatory variable is a significant predictor for the response variable using the tt-test and the associated p-value in the regression output.
- Set the null hypothesis testing for the significance of the predictor as
H0:β1=0H0:β1=0
, and recognize that the standard software output yields the p-value for the two-sided alternative hypothesis.- Note that β1=0β1=0 means the regression line is horizontal, hence suggesting that there is no relationship between the explanatory and the response variables.
- Calculate the T score for the hypothesis test as
Tdf=b1−nullvalueSEb1Tdf=b1−nullvalueSEb1
with df=n−2df=n−2.- Note that the T score has n−2n−2 degrees of freedom since we lose one degree of freedom for each parameter we estimate, and in this case we estimate the intercept and the slope.
- Note that a hypothesis test for the intercept is often irrelevant since it's usually out of the range of the data, and hence it is usually an extrapolation.
- Calculate a confidence interval for the slope as
b1±t⋆dfSEb1b1±t⋆dfSEb1
where df=n−2 andt⋆df
is the critical score associated with the given confidence level at the desired degrees of freedom.- Note that the standard error of the slope estimate
SEb1
can be found on the regression output.
- Note that the standard error of the slope estimate
Supplemental Readings
-
Linear regression with SAT scores - This document outlines the implementation of linear regression step-by-step emphasizing visualizations.