Foundation for InferenceDATA 606 - Statistics & Probability for Data AnalyticsJason Bryer, Ph.D. and Angela Lui, Ph.D.March 8, 20231 / 63

Midterm

Available on March 15th.
Due March 19th (by midnight)
Covers chapters 1 through 5.
20 multiple choice questions.
You may use your notes, textbook, and course site. Do not consult with anyone else.

2 / 63

One Minute Paper Results

What was the most important thing you learned during this class?

What important question remains unanswered for you?

3 / 63

Crash Course in Calculus4 / 63

Crash Course in Calculus

There are three major concepts in calculus that will be helpful to understand:

Limits - the value that a function (or sequence) approaches as the input (or index) approaches some value.

Derivatives - the slope of the line tangent at any given point on a function.

Integrals - the area under the curve.

5 / 63

Derivatives

6 / 63

Derivatives

7 / 63

Derivatives

8 / 63

Derivatives

9 / 63

Derivatives

10 / 63

Derivatives

11 / 63

Derivatives

12 / 63

Derivatives

13 / 63

Derivatives

14 / 63

Function for Normal Distribution

$f\left( x|\mu ,\sigma \right) =\frac { 1 }{ \sigma \sqrt { 2\pi } } { e }^{ -\frac { { \left( x-\mu \right) }^{ 2 } }{ { 2\sigma }^{ 2 } } }$

f <- function(x, mean = 0, sigma = 1) {
    1 / (sigma * sqrt(2 * pi)) * exp(1)^(-1/2 * ( (x - mean) / sigma )^2)
}

min <- 0; max <- 2
ggplot() + stat_function(fun = f) + xlim(c(-4, 4)) + 
    geom_vline(xintercept = c(min, max), color = 'blue', linetype = 2) + xlab('x')

15 / 63

Reimann Sums

One strategy to find the area between two values is to draw a series of rectangles. Given n rectangles, we know that the width of each is $\frac{2 - 0}{n}$ and the height is $f(x)$ . Here is an example with 3 rectangles.

16 / 63

Reimann Sums (10 rectangles)

17 / 63

Reimann Sums (30 rectangles)

18 / 63

Reimann Sums (300 rectangles)

19 / 63

$n\rightarrow \infty$

As n approaches infinity we are going to get the exact value for the area under the curve. This notion of letting a value get increasingly close to infinity, zero, or any other value, is called the limit.

The area under a function is called the integral.

integrate(f, 0, 2)

## 0.4772499 with absolute error < 5.3e-15

DATA606::shiny_demo('calculus')

20 / 63

Normal Distribution

normal_plot(cv = c(0, 2))

pnorm(2) - pnorm(0)

## [1] 0.4772499

21 / 63

R's built in functions for working with distributions

See https://github.com/jbryer/DATA606Fall2021/blob/master/R/distributions.R

22 / 63

Foundation for Inference23 / 63

Population Distribution (Uniform)

n <- 1e5
pop <- runif(n, 0, 1)
mean(pop)

## [1] 0.5010507

24 / 63

Random Sample (n=10)

samp1 <- sample(pop, size=10)
mean(samp1)

## [1] 0.5462003

hist(samp1)

25 / 63

Random Sample (n=30)

samp2 <- sample(pop, size=30)
mean(samp2)

## [1] 0.5380609

hist(samp2)

26 / 63

Lots of Random Samples

M <- 1000
samples <- numeric(length=M)
for(i in seq_len(M)) {
    samples[i] <- mean(sample(pop, size=30))
}
head(samples, n=8)

## [1] 0.4481185 0.5153039 0.5844942 0.4960197 0.4552045 0.3678853 0.4533659
## [8] 0.5247244

27 / 63

Sampling Distribution

hist(samples)

28 / 63

Central Limit Theorem (CLT)

Let $X_1$ , $X_2$ , ..., $X_n$ be independent, identically distributed random variables with mean $\mu$ and variance $\sigma^2$ , both finite. Then for any constant $z$ ,

$\underset { n\rightarrow \infty }{ lim } P\left( \frac { \bar { X } -\mu }{ \sigma /\sqrt { n } } \le z \right) =\Phi \left( z \right)$

where $\Phi$ is the cumulative distribution function (cdf) of the standard normal distribution.

29 / 63

In other words...

The distribution of the sample mean is well approximated by a normal model:

$\bar { x } \sim N\left( mean=\mu ,SE=\frac { \sigma }{ \sqrt { n } } \right)$

where SE represents the standard error, which is defined as the standard deviation of the sampling distribution. In most cases $\sigma$ is not known, so use $s$ .

30 / 63

CLT Shiny App

library(DATA606)
shiny_demo('sampdist')
shiny_demo('CLT_mean')

31 / 63

Standard Error

samp2 <- sample(pop, size=30)
mean(samp2)

## [1] 0.5950164

(samp2.se <- sd(samp2) / sqrt(length(samp2)))

## [1] 0.05579335

32 / 63

Confidence Interval

The confidence interval is then $\mu \pm CV \times SE$ where CV is the critical value. For a 95% confidence interval, the critical value is ~1.96 since

$\int _{ -1.96 }^{ 1.96 }{ \frac { 1 }{ \sigma \sqrt { 2\pi } } { d }^{ -\frac { { \left( x-\mu \right) }^{ 2 } }{ 2{ \sigma }^{ 2 } } } } \approx 0.95$

qnorm(0.025) # Remember we need to consider the two tails, 2.5% to the left, 2.5% to the right.

## [1] -1.959964

(samp2.ci <- c(mean(samp2) - 1.96 * samp2.se, mean(samp2) + 1.96 * samp2.se))

## [1] 0.4856615 0.7043714

33 / 63

Confidence Intervals (cont.)

We are 95% confident that the true population mean is between 0.4856615, 0.7043714.

That is, if we were to take 100 random samples, we would expect at least 95% of those samples to have a mean within 0.4856615, 0.7043714.

ci <- data.frame(mean=numeric(), min=numeric(), max=numeric())
for(i in seq_len(100)) {
    samp <- sample(pop, size=30)
    se <- sd(samp) / sqrt(length(samp))
    ci[i,] <- c(mean(samp),
                mean(samp) - 1.96 * se, 
                mean(samp) + 1.96 * se)
}
ci$sample <- 1:nrow(ci)
ci$sig <- ci$min < 0.5 & ci$max > 0.5

34 / 63

Confidence Intervals

ggplot(ci, aes(x=min, xend=max, y=sample, yend=sample, color=sig)) + 
    geom_vline(xintercept=0.5) + 
    geom_segment() + xlab('CI') + ylab('') +
    scale_color_manual(values=c('TRUE'='grey', 'FALSE'='red'))

35 / 63

Null Hypothesis Testing36 / 63

Hypothesis Testing

We start with a null hypothesis ( $H_0$ ) that represents the status quo.
We also have an alternative hypothesis ( $H_A$ ) that represents our research question, i.e. what we're testing for.
We conduct a hypothesis test under the assumption that the null hypothesis is true, either via simulation or traditional methods based on the central limit theorem.
If the test results suggest that the data do not provide convincing evidence for the alternative hypothesis, we stick with the null hypothesis. If they do, then we reject the null hypothesis in favor of the alternative.

37 / 63

Hypothesis Testing (using CI)

$H_0$ : The mean of samp2 = 0.5
$H_A$ : The mean of samp2 $\ne$ 0.5

Using confidence intervals, if the null value is within the confidence interval, then we fail to reject the null hypothesis.

(samp2.ci <- c(mean(samp2) - 1.96 * sd(samp2) / sqrt(length(samp2)),
               mean(samp2) + 1.96 * sd(samp2) / sqrt(length(samp2))))

## [1] 0.4856615 0.7043714

Since 0.5 fall within 0.4856615, 0.7043714, we fail to reject the null hypothesis.

38 / 63

Hypothesis Testing (using p-values)

$\bar { x } \sim N\left( mean=0.49,SE=\frac { 0.27 }{ \sqrt { 30 } = 0.049 } \right)$

$Z=\frac { \bar { x } -null }{ SE } =\frac { 0.49-0.50 }{ 0.049 } = -.204081633$

pnorm(-.204) * 2

## [1] 0.8383535

39 / 63

Hypothesis Testing (using p-values)

DATA606::normal_plot(cv = c(.204), tails = 'two.sided')

40 / 63

Type I and II Errors

There are two competing hypotheses: the null and the alternative. In a hypothesis test, we make a decision about which might be true, but our choice might be incorrect.

	fail to reject H₀	reject H₀
H₀ true	✔	Type I Error
H_A true	Type II Error	✔

Type I Error: Rejecting the null hypothesis when it is true.
Type II Error: Failing to reject the null hypothesis when it is false.

41 / 63

Hypothesis Test

If we again think of a hypothesis test as a criminal trial then it makes sense to frame the verdict in terms of the null and alternative hypotheses:

H₀ : Defendant is innocent
H_A : Defendant is guilty

Which type of error is being committed in the following circumstances?

Declaring the defendant innocent when they are actually guilty
Type 2 error
Declaring the defendant guilty when they are actually innocent
Type 1 error

Which error do you think is the worse error to make?

42 / 63

Null Distribution

(cv <- qnorm(0.05, mean=0, sd=1, lower.tail=FALSE))

## [1] 1.644854

43 / 63

Alternative Distribution

pnorm(cv, mean=cv, lower.tail = FALSE)

## [1] 0.5

44 / 63

Another Example (mu = 2.5)

mu <- 2.5
(cv <- qnorm(0.05, 
             mean=0, 
             sd=1, 
             lower.tail=FALSE))

## [1] 1.644854

45 / 63

Numeric Values

Type I Error

pnorm(mu, mean=0, sd=1, lower.tail=FALSE)

## [1] 0.006209665

Type II Error

pnorm(cv, mean=mu, lower.tail = TRUE)

## [1] 0.1962351

46 / 63

Shiny Application

Visualizing Type I and Type II errors: https://bcdudek.net/betaprob/

47 / 63

Why p < 0.05?

Check out this page: https://r.bryer.org/shiny/Why05/

Statistical vs. Practical Significance

XKCD p-values

Real differences between the point estimate and null value are easier to detect with larger samples.
However, very large samples will result in statistical significance even for tiny differences between the sample mean and the null value (effect size), even when the difference is not practically significant.
This is especially important to research: if we conduct a study, we want to focus on finding meaningful results (we want observed differences to be real, but also large enough to matter).
The role of a statistician is not just in the analysis of data, but also in planning and design of a study.

49 / 63

Bootstrapping50 / 63

Bootstrapping

First introduced by Efron (1979) in Bootstrap Methods: Another Look at the Jackknife.
Estimates confidence of statistics by resampling with replacement.
The bootstrap sample provides an estimate of the sampling distribution.
The boot R package provides a framework for doing bootstrapping: https://www.statmethods.net/advstats/bootstrapping.html

51 / 63

Bootstrapping Example (Population)

Define our population with a uniform distribution.

n <- 1e5
pop <- runif(n, 0, 1)
mean(pop)

## [1] 0.4990189

52 / 63

Bootstrapping Example (Sample)

We observe one random sample from the population.

samp1 <- sample(pop, size = 50)

53 / 63

Bootsrapping Example (Estimate)

boot.samples <- numeric(1000) # 1,000 bootstrap samples
for(i in seq_along(boot.samples)) { 
    tmp <- sample(samp1, size = length(samp1), replace = TRUE)
    boot.samples[i] <- mean(tmp)
}
head(boot.samples)

## [1] 0.5135789 0.5028579 0.5116830 0.4688500 0.5333870 0.5043232

54 / 63

Bootsrapping Example (Distribution)

d <- density(boot.samples)
h <- hist(boot.samples, plot=FALSE)
hist(boot.samples, main='Bootstrap Distribution', xlab="", freq=FALSE, 
     ylim=c(0, max(d$y, h$density)+.5), col=COL[1,2], border = "white", 
     cex.main = 1.5, cex.axis = 1.5, cex.lab = 1.5)
lines(d, lwd=3)

55 / 63

95% confidence interval

c(mean(boot.samples) - 1.96 * sd(boot.samples), 
  mean(boot.samples) + 1.96 * sd(boot.samples))

## [1] 0.4077162 0.5704055

56 / 63

Bootstrapping is not just for means!

boot.samples.median <- numeric(1000) # 1,000 bootstrap samples
for(i in seq_along(boot.samples.median)) { 
    tmp <- sample(samp1, size = length(samp1), replace = TRUE)
    boot.samples.median[i] <- median(tmp) # NOTICE WE ARE NOW USING THE median FUNCTION!
}
head(boot.samples.median)

## [1] 0.4904070 0.2271723 0.3932846 0.3834050 0.5891012 0.4172554

95% confidence interval for the median

c(mean(boot.samples.median) - 1.96 * sd(boot.samples.median), 
  mean(boot.samples.median) + 1.96 * sd(boot.samples.median))

## [1] 0.3021439 0.6447989

57 / 63

Review58 / 63

Review: Sampling Distribution

59 / 63

Review: Sampling Distribution

60 / 63

Review: Sampling Distribution

61 / 63

Review: Add Bootstrap Distribution

62 / 63

One Minute Paper

Complete the one minute paper: https://forms.gle/p9xcKcTbGiyYSz368

What was the most important thing you learned during this class?
What important question remains unanswered for you?

63 / 63

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

Foundation for Inference

DATA 606 - Statistics & Probability for Data Analytics

Jason Bryer, Ph.D. and Angela Lui, Ph.D.

March 8, 2023

Midterm

One Minute Paper Results

Crash Course in Calculus

Crash Course in Calculus

Derivatives

Derivatives

Derivatives

Derivatives

Derivatives

Derivatives

Derivatives

Derivatives

Derivatives

Function for Normal Distribution

Reimann Sums

Reimann Sums (10 rectangles)

Reimann Sums (30 rectangles)

Reimann Sums (300 rectangles)

n→∞n\rightarrow \infty

Normal Distribution

R's built in functions for working with distributions

Foundation for Inference

Population Distribution (Uniform)

Random Sample (n=10)

Random Sample (n=30)

Lots of Random Samples

Sampling Distribution

Central Limit Theorem (CLT)

In other words...

CLT Shiny App

Standard Error

Confidence Interval

Confidence Intervals (cont.)

Confidence Intervals

Null Hypothesis Testing

Hypothesis Testing

Hypothesis Testing (using CI)

Hypothesis Testing (using p-values)

Hypothesis Testing (using p-values)

Type I and II Errors

Hypothesis Test

Null Distribution

Alternative Distribution

Another Example (mu = 2.5)

Numeric Values

Shiny Application

Why p < 0.05?

Statistical vs. Practical Significance

Bootstrapping

Bootstrapping

Bootstrapping Example (Population)

Bootstrapping Example (Sample)

Bootsrapping Example (Estimate)

Bootsrapping Example (Distribution)

95% confidence interval

Bootstrapping is not just for means!

Review

Review: Sampling Distribution

Review: Sampling Distribution

Review: Sampling Distribution

Review: Add Bootstrap Distribution

One Minute Paper

Midterm

Help

$n\rightarrow \infty$