Multiple Linear RegressionDATA 606 - Statistics & Probability for Data AnalyticsJason Bryer, Ph.D. and Angela Lui, Ph.D.May 3, 20231 / 43

One Minute Paper Results

What was the most important thing you learned during this class?

What important question remains unanswered for you?

2 / 43

Weight of Books

allbacks <- read.csv('../course_data/allbacks.csv')
head(allbacks)

##   X volume area weight cover
## 1 1    885  382    800    hb
## 2 2   1016  468    950    hb
## 3 3   1125  387   1050    hb
## 4 4    239  371    350    hb
## 5 5    701  371    750    hb
## 6 6    641  367    600    hb

From: Maindonald, J.H. & Braun, W.J. (2007). Data Analysis and Graphics Using R, 2nd ed.

3 / 43

Weights of Books (cont)

lm.out <- lm(weight ~ volume, data=allbacks)

$\hat{weight} = 108 + 0.71 volume$ $R^2 = 80\%$

4 / 43

Modeling weights of books using volumesummary(lm.out)

## 
## Call:
## lm(formula = weight ~ volume, data = allbacks)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -189.97 -109.86   38.08  109.73  145.57 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 107.67931   88.37758   1.218    0.245    
## volume        0.70864    0.09746   7.271 6.26e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 123.9 on 13 degrees of freedom
## Multiple R-squared:  0.8026,    Adjusted R-squared:  0.7875 
## F-statistic: 52.87 on 1 and 13 DF,  p-value: 6.262e-06
5 / 43

Weights of hardcover and paperback books

Can you identify a trend in the relationship between volume and weight of hardcover and paperback books?

6 / 43

Weights of hardcover and paperback books

Can you identify a trend in the relationship between volume and weight of hardcover and paperback books?

Paperbacks generally weigh less than hardcover books after controlling for book's volume.

6 / 43

Modeling using volume and cover typelm.out2 <- lm(weight ~ volume + cover, data=allbacks)
summary(lm.out2)

## 
## Call:
## lm(formula = weight ~ volume + cover, data = allbacks)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -110.10  -32.32  -16.10   28.93  210.95 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  197.96284   59.19274   3.344 0.005841 ** 
## volume         0.71795    0.06153  11.669  6.6e-08 ***
## coverpb     -184.04727   40.49420  -4.545 0.000672 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 78.2 on 12 degrees of freedom
## Multiple R-squared:  0.9275,    Adjusted R-squared:  0.9154 
## F-statistic: 76.73 on 2 and 12 DF,  p-value: 1.455e-07
7 / 43

Linear Model

$\hat{weight} = 198 + 0.72 volume - 184 coverpb$

For hardcover books: plug in 0 for cover.

$\hat{weight} = 197.96 + 0.72 volume - 184.05 \times 0 = 197.96 + 0.72 volume$

For paperback books: put in 1 for cover. $\hat{weight} = 197.96 + 0.72 volume - 184.05 \times 1$

8 / 43

Visualizing the linear model

9 / 43

Interpretation of the regression coefficients



     Estimate   Std. Error   t value   Pr(>|t|)   
  (Intercept)   197.9628   59.1927   3.34   0.0058  
  volume   0.7180   0.0615   11.67   0.0000  
  coverpb   -184.0473   40.4942   -4.55   0.0007  

Slope of volume: All else held constant, books that are 1 more cubic centimeter in volume tend to weigh about 0.72 grams more.
Slope of cover: All else held constant, the model predicts that paperback books weigh 184 grams lower than hardcover books.
Intercept: Hardcover books with no volume are expected on average to weigh 198 grams.Obviously, the intercept does not make sense in context. It only serves to adjust the height of the line.

10 / 43

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	197.9628	59.1927	3.34	0.0058
volume	0.7180	0.0615	11.67	0.0000
coverpb	-184.0473	40.4942	-4.55	0.0007

Modeling Poverty

poverty <- read.table("../course_data/poverty.txt", h = T, sep = "\t")
names(poverty) <- c("state", "metro_res", "white", "hs_grad", "poverty", "female_house")
poverty <- poverty[,c(1,5,2,3,4,6)]
head(poverty)

##        state poverty metro_res white hs_grad female_house
## 1    Alabama    14.6      55.4  71.3    79.9         14.2
## 2     Alaska     8.3      65.6  70.8    90.6         10.8
## 3    Arizona    13.3      88.2  87.7    83.8         11.1
## 4   Arkansas    18.0      52.5  81.0    80.9         12.1
## 5 California    12.8      94.4  77.5    81.1         12.6
## 6   Colorado     9.4      84.5  90.2    88.7          9.6

From: Gelman, H. (2007). Data Analysis using Regression and Multilevel/Hierarchial Models. Cambridge University Press.

11 / 43

Modeling Poverty

12 / 43

Predicting Poverty using Percent Female Householderlm.poverty <- lm(poverty ~ female_house, data=poverty)
summary(lm.poverty)

## 
## Call:
## lm(formula = poverty ~ female_house, data = poverty)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7537 -1.8252 -0.0375  1.5565  6.3285 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.3094     1.8970   1.745   0.0873 .  
## female_house   0.6911     0.1599   4.322 7.53e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.664 on 49 degrees of freedom
## Multiple R-squared:  0.276,    Adjusted R-squared:  0.2613 
## F-statistic: 18.68 on 1 and 49 DF,  p-value: 7.534e-05
13 / 43

% Poverty by % Female Household

14 / 43

Another look at $R^2$

$R^2$ can be calculated in three ways:

square the correlation coefficient of x and y (how we have been calculating it)
square the correlation coefficient of y and $\hat{y}$
based on definition:
$R^2 = \frac{explained \quad variability \quad in \quad y}{total \quad variability \quad in \quad y}$

Using ANOVA we can calculate the explained variability and total variability in y.

15 / 43

Sum of Squares

anova.poverty <- anova(lm.poverty)
print(xtable::xtable(anova.poverty, digits = 2), type='html')

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
female_house	1.00	132.57	132.57	18.68	0.00
Residuals	49.00	347.68	7.10

16 / 43

Sum of Squares

anova.poverty <- anova(lm.poverty)
print(xtable::xtable(anova.poverty, digits = 2), type='html')

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
female_house	1.00	132.57	132.57	18.68	0.00
Residuals	49.00	347.68	7.10

Sum of squares of y: ${ SS }_{ Total }=\sum { { \left( y-\bar { y } \right) }^{ 2 } } =480.25$ → total variability

Sum of squares of residuals: ${ SS }_{ Error }=\sum { { e }_{ i }^{ 2 } } =347.68$ → unexplained variability

Sum of squares of x: ${ SS }_{ Model }={ SS }_{ Total }-{ SS }_{ Error } = 132.57$ → explained variability

$R^2 = \frac{explained \quad variability \quad in \quad y}{total \quad variability \quad in \quad y} = \frac{132.57}{480.25} = 0.28$

16 / 43

Why bother?

For single-predictor linear regression, having three ways to calculate the same value may seem like overkill.
However, in multiple linear regression, we can't calculate $R^2$ as the square of the correlation between x and y because we have multiple xs.
And next we'll learn another measure of explained variability, adjusted $R^2$ , that requires the use of the third approach, ratio of explained and unexplained variability.

17 / 43

Predicting poverty using % female household & % white

lm.poverty2 <- lm(poverty ~ female_house + white, data=poverty)
print(xtable::xtable(lm.poverty2), type='html')

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-2.5789	5.7849	-0.45	0.6577
female_house	0.8869	0.2419	3.67	0.0006
white	0.0442	0.0410	1.08	0.2868

anova.poverty2 <- anova(lm.poverty2)
print(xtable::xtable(anova.poverty2, digits = 3), type='html')

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
female_house	1.000	132.568	132.568	18.745	0.000
white	1.000	8.207	8.207	1.160	0.287
Residuals	48.000	339.472	7.072

$R^2 = \frac{explained \quad variability \quad in \quad y}{total \quad variability \quad in \quad y} = \frac{132.57 + 8.21}{480.25} = 0.29$

18 / 43

Unique information

Does adding the variable white to the model add valuable information that wasn't provided by female_house?

19 / 43

Collinearity between explanatory variables

poverty vs % female head of household

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	3.3094	1.8970	1.74	0.0873
female_house	0.6911	0.1599	4.32	0.0001

poverty vs % female head of household and % female household

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-2.5789	5.7849	-0.45	0.6577
female_house	0.8869	0.2419	3.67	0.0006
white	0.0442	0.0410	1.08	0.2868

Note the difference in the estimate for female_house.

20 / 43

Collinearity between explanatory variables

Two predictor variables are said to be collinear when they are correlated, and this collinearity complicates model estimation.
Remember: Predictors are also called explanatory or independent variables. Ideally, they would be independent of each other.
We don't like adding predictors that are associated with each other to the model, because often times the addition of such variable brings nothing to the table. Instead, we prefer the simplest best model, i.e. parsimonious model.
While it's impossible to avoid collinearity from arising in observational data, experiments are usually designed to prevent correlation among predictors

21 / 43

R2 vs. adjusted R2

Model
R2
Adjusted R2


Model 1 (Single-predictor)
0.28
0.26

Model 2 (Multiple)
0.29
0.26

When any variable is added to the model R2 increases.
But if the added variable doesn't really provide any new information, or is completely unrelated, adjusted R2 does not increase.
22 / 43

Model	$R^2$	Adjusted $R^2$
Model 1 (Single-predictor)	0.28	0.26
Model 2 (Multiple)	0.29	0.26

Adjusted $R^2$

${ R }_{ adj }^{ 2 }={ 1-\left( \frac { { SS }_{ error } }{ { SS }_{ total } } \times \frac { n-1 }{ n-p-1 } \right) }$

where n is the number of cases and p is the number of predictors (explanatory variables) in the model.

Because p is never negative, ${ R }_{ adj }^{ 2 }$ will always be smaller than $R^2$ .
${ R }_{ adj }^{ 2 }$ applies a penalty for the number of predictors included in the model.
Therefore, we choose models with higher ${ R }_{ adj }^{ 2 }$ over others.

23 / 43

Predictive Modeling24 / 43

Example: Hours Studying Predicting Passing

study <- data.frame(
    Hours=c(0.50,0.75,1.00,1.25,1.50,1.75,1.75,2.00,2.25,2.50,2.75,3.00,
            3.25,3.50,4.00,4.25,4.50,4.75,5.00,5.50),
    Pass=c(0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1)
)
study[sample(nrow(study), 5),]

##    Hours Pass
## 4   1.25    0
## 2   0.75    0
## 11  2.75    1
## 9   2.25    1
## 1   0.50    0

tab <- describeBy(study$Hours, group = study$Pass, mat = TRUE, skew = FALSE)
tab$group1 <- as.integer(as.character(tab$group1))

25 / 43

Prediction

Odds (or probability) of passing if studied zero hours?

$log(\frac{p}{1-p}) = -4.078 + 1.505 \times 0$ $\frac{p}{1-p} = exp(-4.078) = 0.0169$ $p = \frac{0.0169}{1.169} = .016$

26 / 43

Prediction

Odds (or probability) of passing if studied zero hours?

$log(\frac{p}{1-p}) = -4.078 + 1.505 \times 0$ $\frac{p}{1-p} = exp(-4.078) = 0.0169$ $p = \frac{0.0169}{1.169} = .016$

Odds (or probability) of passing if studied 4 hours?

$log(\frac{p}{1-p}) = -4.078 + 1.505 \times 4$ $\frac{p}{1-p} = exp(1.942) = 6.97$ $p = \frac{6.97}{7.97} = 0.875$

26 / 43

Fitted Values

study[1,]

##   Hours Pass
## 1   0.5    0

logistic <- function(x, b0, b1) {
    return(1 / (1 + exp(-1 * (b0 + b1 * x)) ))
}
logistic(.5, b0=-4.078, b1=1.505)

## [1] 0.03470667

27 / 43

Model Performance

The use of statistical models to predict outcomes, typically on new data, is called predictive modeling. Logistic regression is a common statistical procedure used for prediction. We will utilize a confusion matrix to evaluate accuracy of the predictions.

28 / 43

Predicting Heart Attacks

Source: https://www.kaggle.com/datasets/imnikhilanand/heart-attack-prediction?select=data.csv

heart <- read.csv('../course_data/heart_attack_predictions.csv')
heart <- heart |>
    mutate_if(is.character, as.numeric) |>
    select(!c(slope, ca, thal))
str(heart)

## 'data.frame':    294 obs. of  11 variables:
##  $ age     : int  28 29 29 30 31 32 32 32 33 34 ...
##  $ sex     : int  1 1 1 0 0 0 1 1 1 0 ...
##  $ cp      : int  2 2 2 1 2 2 2 2 3 2 ...
##  $ trestbps: num  130 120 140 170 100 105 110 125 120 130 ...
##  $ chol    : num  132 243 NA 237 219 198 225 254 298 161 ...
##  $ fbs     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ restecg : num  2 0 0 1 1 0 0 0 0 0 ...
##  $ thalach : num  185 160 170 170 150 165 184 155 185 190 ...
##  $ exang   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ oldpeak : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ num     : int  0 0 0 0 0 0 0 0 0 0 ...

Note: num is the diagnosis of heart disease (angiographic disease status) (i.e. Value 0: < 50% diameter narrowing -- Value 1: > 50% diameter narrowing)

29 / 43

Missing Data

We will save this for another day...

complete.cases(heart) |> table()

## 
## FALSE  TRUE 
##    33   261

mice_out <- mice::mice(heart, m = 1)

## 
##  iter imp variable
##   1   1  trestbps  chol  fbs  restecg  thalach  exang
##   2   1  trestbps  chol  fbs  restecg  thalach  exang
##   3   1  trestbps  chol  fbs  restecg  thalach  exang
##   4   1  trestbps  chol  fbs  restecg  thalach  exang
##   5   1  trestbps  chol  fbs  restecg  thalach  exang

heart <- mice::complete(mice_out)

30 / 43

Data Setup

We will split the data into a training set (70% of observations) and validation set (30%).

train.rows <- sample(nrow(heart), nrow(heart) * .7)
heart_train <- heart[train.rows,]
heart_test <- heart[-train.rows,]

This is the proportions of survivors and defines what our "guessing" rate is. That is, if we guessed no one had a heart attack, we would be correct 62% of the time.

(heart_attack <- table(heart_train$num) %>% prop.table)

## 
##         0         1 
## 0.6146341 0.3853659

31 / 43

Model Training

lr.out <- glm(num ~ ., data=heart_train, family=binomial(link = 'logit'))
summary(lr.out)

## 
## Call:
## glm(formula = num ~ ., family = binomial(link = "logit"), data = heart_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5444  -0.4916  -0.2460   0.4070   2.4499  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.317639   3.267008  -1.628 0.103593    
## age          0.009236   0.032561   0.284 0.776686    
## sex          1.326149   0.532826   2.489 0.012814 *  
## cp           0.979203   0.265746   3.685 0.000229 ***
## trestbps    -0.009196   0.012694  -0.724 0.468809    
## chol         0.007359   0.003416   2.154 0.031215 *  
## fbs          2.230179   0.835767   2.668 0.007621 ** 
## restecg     -1.302957   0.545488  -2.389 0.016912 *  
## thalach     -0.009773   0.010928  -0.894 0.371148    
## exang        1.085218   0.552319   1.965 0.049433 *  
## oldpeak      1.060572   0.303967   3.489 0.000485 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 273.32  on 204  degrees of freedom
## Residual deviance: 145.44  on 194  degrees of freedom
## AIC: 167.44
## 
## Number of Fisher Scoring iterations: 6

32 / 43

Predicted Values

heart_train$prediction <- predict(lr.out, type = 'response', newdata = heart_train)
ggplot(heart_train, aes(x = prediction, color = num == 1)) + geom_density()

33 / 43

Results

heart_train$prediction_class <- heart_train$prediction > 0.5
tab <- table(heart_train$prediction_class, 
             heart_train$num) %>% prop.table() %>% print()

##        
##                  0          1
##   FALSE 0.55121951 0.10243902
##   TRUE  0.06341463 0.28292683

For the training set, the overall accuracy is 83.41%. Recall that 61.46% people did not have a heart attach. Therefore, the simplest model would be to predict that no one had a heart attack, which would mean we would be correct 61.46% of the time. Therefore, our prediction model is 21.95% better than guessing.

34 / 43

Checking with the validation dataset

(survived_test <- table(heart_test$num) %>% prop.table())

## 
##         0         1 
## 0.6966292 0.3033708

heart_test$prediction <- predict(lr.out, newdata = heart_test, type = 'response')
heart_test$prediciton_class <- heart_test$prediction > 0.5
tab_test <- table(heart_test$prediciton_class, heart_test$num) %>%
    prop.table() %>% print()

##        
##                  0          1
##   FALSE 0.64044944 0.07865169
##   TRUE  0.05617978 0.22471910

The overall accuracy is 86.52%, or 16.9% better than guessing.

35 / 43

Receiver Operating Characteristic (ROC) Curve

The ROC curve is created by plotting the true positive rate (TPR; AKA sensitivity) against the false positive rate (FPR; AKA probability of false alarm) at various threshold settings.

roc <- calculate_roc(heart_train$prediction, heart_train$num == 1)
summary(roc)

## AUC = 0.915
## Cost of false-positive = 1
## Cost of false-negative = 1
## Threshold with minimum cost = 0.556

36 / 43

ROC Curve

plot(roc)

37 / 43

ROC Curve

plot(roc, curve = 'accuracy')

38 / 43

Caution on Interpreting Accuracy

Loh, Sooo, and Zing (2016) predicted sexual orientation based on Facebook Status.
They reported model accuracies of approximately 90% using SVM, logistic regression and/or random forest methods.

39 / 43

Caution on Interpreting Accuracy

Loh, Sooo, and Zing (2016) predicted sexual orientation based on Facebook Status.
They reported model accuracies of approximately 90% using SVM, logistic regression and/or random forest methods.
Gallup (2018) poll estimates that 4.5% of the Americal population identifies as LGBT.

39 / 43

Caution on Interpreting Accuracy

Loh, Sooo, and Zing (2016) predicted sexual orientation based on Facebook Status.
They reported model accuracies of approximately 90% using SVM, logistic regression and/or random forest methods.
Gallup (2018) poll estimates that 4.5% of the Americal population identifies as LGBT.
My proposed model: I predict all Americans are heterosexual.
The accuracy of my model is 95.5%, or 5.5% better than Facebook's model!
Predicting "rare" events (i.e. when the proportion of one of the two outcomes large) is difficult and requires independent (predictor) variables that strongly associated with the dependent (outcome) variable.

39 / 43

Fitted Values Revisited

What happens when the ratio of true-to-false increases (i.e. want to predict "rare" events)?

Let's simulate a dataset where the ratio of true-to-false is 10-to-1. We can also define the distribution of the dependent variable. Here, there is moderate separation in the distributions.

test.df2 <- getSimulatedData(
    treat.mean=.6, control.mean=.4)

The multilevelPSA::psrange function will sample with varying ratios from 1:10 to 1:1. It takes multiple samples and averages the ranges and distributions of the fitted values from logistic regression.

psranges2 <- psrange(test.df2, test.df2$treat, treat ~ .,
                     samples=seq(100,1000,by=100), nboot=20)

40 / 43

Fitted Values Revisited (cont.)

plot(psranges2)

41 / 43

Additional Resources

42 / 43

One Minute Paper

Complete the one minute paper: https://forms.gle/p9xcKcTbGiyYSz368

What was the most important thing you learned during this class?
What important question remains unanswered for you?

43 / 43

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

Multiple Linear Regression

DATA 606 - Statistics & Probability for Data Analytics

Jason Bryer, Ph.D. and Angela Lui, Ph.D.

May 3, 2023

One Minute Paper Results

Weight of Books

Weights of Books (cont)

Modeling weights of books using volume

Weights of hardcover and paperback books

Weights of hardcover and paperback books

Modeling using volume and cover type

Linear Model

Visualizing the linear model

Interpretation of the regression coefficients

Modeling Poverty

Modeling Poverty

Predicting Poverty using Percent Female Householder

% Poverty by % Female Household

Another look at R2R^2

Sum of Squares

Sum of Squares

Why bother?

Predicting poverty using % female household & % white

Unique information

Collinearity between explanatory variables

Collinearity between explanatory variables

R2R^2 vs. adjusted R2R^2

Adjusted R2R^2

Predictive Modeling

Example: Hours Studying Predicting Passing

Prediction

Prediction

Fitted Values

Model Performance

Predicting Heart Attacks

Missing Data

Data Setup

Model Training

Predicted Values

Results

Checking with the validation dataset

Receiver Operating Characteristic (ROC) Curve

ROC Curve

ROC Curve

Caution on Interpreting Accuracy

Caution on Interpreting Accuracy

Caution on Interpreting Accuracy

Fitted Values Revisited

Fitted Values Revisited (cont.)

Additional Resources

One Minute Paper

One Minute Paper Results

Help

Another look at $R^2$

$R^2$ vs. adjusted $R^2$

Adjusted $R^2$