What was the most important thing you learned during this class?
What important question remains unanswered for you?
Due April 2ndish Select a dataset that interests you. For the proposal, you need to answer the questions below.
More information including template and suggested datasets located here: https://spring2023.data606.net/assignments/project/
200 randomly selected students completed the reading and writing test of the High School and Beyond survey. The results appear to the right. Does there appear to be a difference?
data(hsb2) # in openintro packagehsb2.melt <- melt(hsb2[,c('id','read', 'write')], id='id')ggplot(hsb2.melt, aes(x=variable, y=value)) + geom_boxplot() + geom_point(alpha=0.2, color='blue') + xlab('Test') + ylab('Score')
head(hsb2)
## # A tibble: 6 × 11## id gender race ses schtyp prog read write math science socst## <int> <chr> <chr> <fct> <fct> <fct> <int> <int> <int> <int> <int>## 1 70 male white low public general 57 52 41 47 57## 2 121 female white middle public vocational 68 59 53 63 61## 3 86 male white high public general 44 33 54 58 31## 4 141 male white high public vocational 63 44 47 53 56## 5 172 male white middle public academic 47 52 57 53 61## 6 113 male white middle public academic 44 52 51 63 61
Are the reading and writing scores of each student independent of each other?
hsb2$diff <- hsb2$read - hsb2$writehead(hsb2$diff)
## [1] 5 9 11 19 -5 -8
ggplot(hsb2, aes(x = diff)) + geom_histogram(aes(y = ..density..), bins = 15, color = 1, fill = 'white') + geom_density(size = 2)
What are the hypothesis for testing if there is a difference between the average reading and writing scores?
H0: There is no difference between the average reading and writing scores.
μdiff=0
HA: There is a difference between the average reading and writing score.
μdiff≠0
The analysis is no different that what we have done before.
We have data from one sample: differences.
We are testing to see if the average difference is different that 0.
The observed average difference between the two scores is -0.545 points and the standard deviation of the difference is 8.887 points. Do these data provide convincing evidence of a difference between the average scores on the two exams (use α=0.05)?
Z=−0.545−08.887√200=−0.5450.628=−0.87 p−value=0.1949×2=0.3898
Since p-value > 0.05, we fail to reject the null hypothesis. That is, the data do not provide evidence that there is a statistically significant difference between the average reading and writing scores.
2 * pnorm(mean(hsb2$diff), mean=0, sd=sd(hsb2$diff)/sqrt(nrow(hsb2)))
## [1] 0.3857741
The probability of obtaining a random sample of 200 students where the average difference between the reading and writing scores is at least 0.545 (in either direction), if in fact the true average difference between the score is 0, is 38%.
The probability of obtaining a random sample of 200 students where the average difference between the reading and writing scores is at least 0.545 (in either direction), if in fact the true average difference between the score is 0, is 38%.
−0.545±1.968.887√200=−0.545±1.96×0.628=(−1.775,0.685)
Note that the confidence interval spans zero!
library(granovaGG)granovagg.ds(as.data.frame(hsb2[,c('read', 'write')]))
data(sat)head(sat)
## Verbal.SAT Math.SAT Sex## 1 450 450 F## 2 640 540 F## 3 590 570 M## 4 400 400 M## 5 600 590 M## 6 610 610 M
Is there a difference in math scores between males and females?
tab <- describeBy(sat$Math.SAT, group=sat$Sex, mat=TRUE, skew=FALSE)tab[,c(2,4:7)]
## group1 n mean sd min## X11 F 82 597.6829 103.70065 360## X12 M 80 626.8750 90.35225 390
ggplot(sat, aes(x=Sex, y=Math.SAT)) + geom_boxplot() + geom_point(data = tab, aes(x=group1, y=mean), color='blue', size=4)
ggplot(sat, aes(x=Math.SAT, color = Sex)) + geom_density()
We wish to calculate a 95% confidence interval for the average difference between SAT scores for males and females.
Assumptions:
Independence within groups.
Independence between groups.
Sample size/skew
Standard error for difference in SAT scores
SE(ˉxM−ˉxF)=√s2MnM+s2FnF
SE(ˉxM−ˉxF)=√90.480+103.782=1.55
Calculate the 95% confidence interval:
(ˉxM−ˉxF)±1.96SE(ˉxM−ˉxF) (626.9−597.7)±1.96×1.55 29.2±3.038=(2s6.162,32.238)
What if you want to compare the quality of one batch of Guinness beer to the next?
What if you want to compare the quality of one batch of Guinness beer to the next?
Confidence interval is estimated using
¯x±t∗dfSE
Where df is the degrees of freedom (df = n -1)
The pt
and qt
will give you the p-value and critical value from the t-distribution, respectively.
Critical value for p = 0.05, degrees of freedom = 10
qt(0.025, df = 10)
## [1] -2.228139
p-value for a critical value of 2, degrees of freedom = 10
pt(2, df=10)
## [1] 0.963306
The t.test
function will calculate a null hyphothesis test using the t-distribution.
t.test(Math.SAT ~ Sex, data = sat)
## ## Welch Two Sample t-test## ## data: Math.SAT by Sex## t = -1.9117, df = 158.01, p-value = 0.05773## alternative hypothesis: true difference in means between group F and group M is not equal to 0## 95 percent confidence interval:## -59.3527145 0.9685682## sample estimates:## mean in group F mean in group M ## 597.6829 626.8750
The goal of ANOVA is to test whether there is a discernible difference between the means of several groups.
Hand Washing Example
Is there a difference between washing hands with: water only, regular soap, antibacterial soap (ABS), and antibacterial spray (AS)?
For ANOVA:
Source: De Veaux, R.D., Velleman, P.F., & Bock, D.E. (2014). Intro Stats, 4th Ed. Pearson.
ggplot(hand_washing, aes(x = Method, y = Bacterial_Counts)) + geom_boxplot() + geom_beeswarm(aes(color = Method)) + theme(legend.position = 'none')
desc <- psych::describeBy(hand_washing$Bacterial_Counts, group = hand_washing$Method, mat = TRUE, skew = FALSE)names(desc)[2] <- 'Method' # Rename the grouping columndesc$Var <- desc$sd^2 # We will need the variance latter, so calculate it heredesc
## item Method vars n mean sd min max range se Var## X11 1 Alcohol Spray 1 8 37.5 26.55991 5 82 77 9.390345 705.4286## X12 2 Antibacterial Soap 1 8 92.5 41.96257 20 164 144 14.836008 1760.8571## X13 3 Soap 1 8 106.0 46.95895 51 207 156 16.602496 2205.1429## X14 4 Water 1 8 117.0 31.13106 74 170 96 11.006492 969.1429
( k <- length(unique(hand_washing$Method)) )
## [1] 4
( n <- nrow(hand_washing) )
## [1] 32
( grand_mean <- mean(hand_washing$Bacterial_Counts) )
## [1] 88.25
( grand_var <- var(hand_washing$Bacterial_Counts) )
## [1] 2237.613
( pooled_var <- mean(desc$Var) )
## [1] 1410.143
A contrast is a linear combination of two or more factor level means with coefficients that sum to zero.
desc$contrast <- (desc$mean - mean(desc$mean))mean(desc$contrast) # Should be 0!
## [1] 0
desc
## item Method vars n mean sd min max range se Var contrast## X11 1 Alcohol Spray 1 8 37.5 26.55991 5 82 77 9.390345 705.4286 -50.75## X12 2 Antibacterial Soap 1 8 92.5 41.96257 20 164 144 14.836008 1760.8571 4.25## X13 3 Soap 1 8 106.0 46.95895 51 207 156 16.602496 2205.1429 17.75## X14 4 Water 1 8 117.0 31.13106 74 170 96 11.006492 969.1429 28.75
SSwithin=∑k∑i(ˉxik−ˉxk)2
SSbetween=∑knk(ˉxk−ˉx)2
Source | Sum of Squares | df | MS |
---|---|---|---|
Between Group (Treatment) | ∑knk(ˉxk−ˉx)2 | k - 1 | SSbetweendfbetween |
Within Group (Error) | ∑k∑i(ˉxik−ˉxk)2 | n - k | SSwithindfwithin |
Total | ∑k∑i(ˉxik−ˉx)2 | n - 1 |
Mean squares can be represented as squares, hence the ratio of area of the two rectagles is equal to MSBetweenMSWithin which is the F-statistic.
H0:μ1=μ2=μ3=μ4
Variance components we need to evaluate the null hypothesis:
Between Sum of Squares: SSbetween=∑knk(ˉxk−ˉx)2
Within Sum of Squares: SSwithin=∑k∑i(ˉxik−ˉxk)2
Between degrees of freedom: dfbetween=k−1 (k = number of groups)
Within degrees of freedom: dfwithin=k(n−1)
Mean square between (aka treatment): MST=SSbetweendfbetween
Mean square within (aka error): MSE=SSwithindfwithin
Assume each washing method has the same variance.
Then we can pool them all together to get the pooled variance s2p
Since the sample sizes are all equal, we can average the four variances: s2p=1410.14
mean(desc$Var)
## [1] 1410.143
Assume each washing method has the same variance.
Then we can pool them all together to get the pooled variance s2p
Since the sample sizes are all equal, we can average the four variances: s2p=1410.14
mean(desc$Var)
## [1] 1410.143
MST
MSE
Comparing
How do we tell whether MSTMSE is larger enough to not be due just to random chance?
MSTMSE follows the F-Distribution
F=MSTMSE is called the F-Statistic.
A Shiny App by Dr. Dudek to explore the F-Distribution: https://shiny.rit.albany.edu/stat/fdist/
df.numerator <- 4 - 1df.denominator <- 4 * (8 - 1)DATA606::F_plot(df.numerator, df.denominator, cv = qf(0.95, df.numerator, df.denominator))
Source | Sum of Squares | df | MS | F | p |
---|---|---|---|---|---|
Between Group (Treatment) | ∑knk(ˉxk−ˉx)2 | k - 1 | SSbetweendfbetween | MSbetweenMSwithin | area to right of Fk−1,n−k |
Within Group (Error) | ∑k∑i(ˉxik−ˉxk)2 | n - k | SSwithindfwithin | ||
Total | ∑k∑i(ˉxik−ˉx)2 | n - 1 |
aov(Bacterial_Counts ~ Method, data = hand_washing) |> summary()
## Df Sum Sq Mean Sq F value Pr(>F) ## Method 3 29882 9961 7.064 0.00111 **## Residuals 28 39484 1410 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To check the assumptions and conditions for ANOVA, always look at the side-by-side boxplots.
Independence Assumption
Equal Variance Assumption
ANOVA Vignette in the VisualStats
package: https://jbryer.github.io/VisualStats/articles/anova.html
The plots were created using the VisualStats::anova_vis()
function.
Shiny app:
remotes::install_github('jbryer/VisualStats')library(VisualStats)library(ShinyDemo)shiny_demo('anova', package = 'VisualStats')
cv <- qt(0.05, df = 15)tab <- describeBy(hand_washing$Bacterial_Counts, group = hand_washing$Method, mat = TRUE)ggplot(hand_washing, aes(x = Method, y = Bacterial_Counts)) + geom_boxplot() + geom_errorbar(data = tab, aes(x = group1, y = mean, ymin = mean - cv * se, ymax = mean + cv * se), color = 'darkgreen', width = 0.5, size = 1) + geom_point(data = tab, aes(x = group1, y = mean), color = 'blue', size = 3)
cv <- qt(0.05 / 3, df = 15)tab <- describeBy(hand_washing$Bacterial_Counts, group = hand_washing$Method, mat = TRUE)ggplot(hand_washing, aes(x = Method, y = Bacterial_Counts)) + geom_boxplot() + geom_errorbar(data = tab, aes(x = group1, y = mean, ymin = mean - cv * se, ymax = mean + cv * se), color = 'darkgreen', width = 0.5, size = 1) + geom_point(data = tab, aes(x = group1, y = mean), color = 'blue', size = 3)
cv <- qt(0.05 / choose(4, 2), df = 15)tab <- describeBy(hand_washing$Bacterial_Counts, group = hand_washing$Method, mat = TRUE)ggplot(hand_washing, aes(x = Method, y = Bacterial_Counts)) + geom_boxplot() + geom_errorbar(data = tab, aes(x = group1, y = mean, ymin = mean - cv * se, ymax = mean + cv * se ), color = 'darkgreen', width = 0.5, size = 1) + geom_point(data = tab, aes(x = group1, y = mean), color = 'blue', size = 3)
Complete the one minute paper: https://forms.gle/p9xcKcTbGiyYSz368
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |