DATA 606 - Statistics & Probability - Spring 2023

Chapter 8

Introduction to Linear Regression

Learning Outcomes

  • Define the explanatory variable as the independent variable (predictor), and the response variable as the dependent variable (predicted).
  • Plot the explanatory variable (xx) on the x-axis and the response variable (yy) on the y-axis, and fit a linear regression model y=β0+β1xy=β0+β1x where β0β0 is the intercept, and β1β1 is the slope.
    • Note that the point estimates (estimated from observed data) for β0β0 and β1β1 are b0b0 and b1b1, respectively.
  • When describing the association between two numerical variables, evaluate
    • direction: positive (x,yx,y), negative (x,yx,y)
    • form: linear or not
    • strength: determined by the scatter around the underlying relationship
  • Define correlation as the \emph{linear} association between two numerical variables.
    • Note that a relationship that is nonlinear is simply called an association.
  • Note that correlation coefficient (rr, also called Pearson's rr) the following properties:
    • the magnitude (absolute value) of the correlation coefficient measures the strength of the linear association between two numerical variables
    • the sign of the correlation coefficient indicates the direction of association
    • the correlation coefficient is always between -1 and 1, inclusive, with -1 indicating perfect negative linear association, +1 indicating perfect positive linear association, and 0 indicating no \emph{linear} relationship
    • the correlation coefficient is unitless
    • since the correlation coefficient is unitless, it is not affected by changes in the center or scale of either variable (such as unit conversions)
    • the correlation of X with Y is the same as of Y with X
    • the correlation coefficient is sensitive to outliers
  • Recall that correlation does not imply causation.
  • Define residual (ee) as the difference between the observed (yy) and predicted (ˆy^y) values of the response variable. ei=yiˆyiei=yi^yi
  • Define the least squares line as the line that minimizes the sum of the squared residuals, and list conditions necessary for fitting such line:
    1. linearity
    2. nearly normal residuals
    3. constant variability
  • Define an indicator variable as a binary explanatory variable (with two levels).
  • Calculate the estimate for the slope (b1b1) as b1=Rsysxb1=Rsysx, where rr is the correlation coefficient, sysy is the standard deviation of the response variable, and sxsx is the standard deviation of the explanatory variable.
  • Interpret the slope as
    • “For each unit increase in xx, we would expect yy to increase/decrease on average by |b1||b1| units” when xx is numerical.
    • “The average increase/decrease in the response variable when between the baseline level and the other level of the explanatory variable is |b1||b1|.” when xx is categorical.
    • Note that whether the response variable increases or decreases is determined by the sign of b1b1.
  • Note that the least squares line always passes through the average of the response and explanatory variables (ˉx,ˉy¯x,¯y).
  • Use the above property to calculate the estimate for the slope (b0b0) as b0=ˉyb1ˉxb0=¯yb1¯x, where b1b1 is the slope, ˉy¯y is the average of the response variable, and ˉx¯x is the average of explanatory variable.
  • Interpret the intercept as
    • “When x=0x=0, we would expect yy to equal, on average, b0b0.” when xx is numerical.
    • “The expected average value of the response variable for the reference level of the explanatory variable is b0b0.” when xx is categorical.
  • Predict the value of the response variable for a given value of the explanatory variable, xx, by plugging in xx in the in the linear model: ˆy=b0+b1x^y=b0+b1x
    • Only predict for values of xx that are in the range of the observed data.
    • Do not extrapolate beyond the range of the data, unless you are confident that the linear pattern continues.
  • Define R2R2 as the percentage of the variability in the response variable explained by the the explanatory variable.
    • For a good model, we would like this number to be as close to 100% as possible.
    • This value is calculated as the square of the correlation coefficient, and is between 0 and 1, inclusive.
  • Define a leverage point as a point that lies away from the center of the data in the horizontal direction.
  • Define an influential point as a point that influences (changes) the slope of the regression line.
    • This is usually a leverage point that is away from the trajectory of the rest of the data.
  • Do not remove outliers from an analysis without good reason.
  • Be cautious about using a categorical explanatory variable when one of the levels has very few observations, as these may act as influential points.
  • Determine whether an explanatory variable is a significant predictor for the response variable using the tt-test and the associated p-value in the regression output.
  • Set the null hypothesis testing for the significance of the predictor as H0:β1=0H0:β1=0, and recognize that the standard software output yields the p-value for the two-sided alternative hypothesis.
    • Note that β1=0β1=0 means the regression line is horizontal, hence suggesting that there is no relationship between the explanatory and the response variables.
  • Calculate the T score for the hypothesis test as Tdf=b1nullvalueSEb1Tdf=b1nullvalueSEb1 with df=n2df=n2.
    • Note that the T score has n2n2 degrees of freedom since we lose one degree of freedom for each parameter we estimate, and in this case we estimate the intercept and the slope.
  • Note that a hypothesis test for the intercept is often irrelevant since it's usually out of the range of the data, and hence it is usually an extrapolation.
  • Calculate a confidence interval for the slope as b1±tdfSEb1b1±tdfSEb1 where df=n2 and tdf is the critical score associated with the given confidence level at the desired degrees of freedom.
    • Note that the standard error of the slope estimate SEb1 can be found on the regression output.

Supplemental Readings

Videos

Last updated on Sat Apr 29, 2017
Published on Sat Apr 29, 2017
Edit on GitHub