What was the most important thing you learned during this class?
What important question remains unanswered for you?
Data Type | Descriptive Stats | Visualization |
---|---|---|
Continuous | mean, median, mode, standard deviation, IQR | histogram, density, box plot |
Discrete | contingency table, proportional table, median | bar plot |
Categorical | contingency table, proportional table | bar plot |
Ordinal | contingency table, proportional table, median | bar plot |
Two quantitative | correlation | scatter plot |
Two qualitative | contingency table, chi-squared | mosaic plot, bar plot |
Quantitative & Qualitative | grouped summaries, ANOVA, t-test | box plot |
Population Variance: S2=Σ(xi−ˉx)2N Consider a dataset with five values (black points in the figure). For the largest value, the deviance is represented by the blue line ( xi−ˉx ).
See also:
https://shiny.rit.albany.edu/stat/visualizess/
https://github.com/jbryer/VisualStats/
Population Variance: S2=Σ(xi−ˉx)2N In the numerator, we square each of these deviances. We can conceptualize this as a square. Here, we add the deviance in the y direction.
Population Variance: S2=Σ(xi−ˉx)2N
We end up with a square.
Population Variance: S2=Σ(xi−ˉx)2N We can plot the squared deviance for all the data points. That is, each component in the numerator is the area of each of these squares.
Population Variance: S2=Σ(xi−ˉx)2N The variance is therefore the average of the area of all these squares, here represented by the orange square.
Typically we want the sample variance. The difference is we divide by n−1 to calculate the sample variance. This results in a slightly larger area (variance) then if we divide by n.
Population Variance (yellow): S2=Σ(xi−ˉx)2N
Sample Variance (green): s2=Σ(xi−ˉx)2n−1
Consider the following data randomly selected from the normal distribution:
set.seed(41)x <- rnorm(30, mean = 100, sd = 15)mean(x); sd(x)
## [1] 103.1934
## [1] 16.8945
median(x); IQR(x)
## [1] 103.9947
## [1] 25.68004
Let's add an extreme value:
x <- c(x, 1000)
Let's add an extreme value:
x <- c(x, 1000)
Median and IQR are more robust to skewness and outliers than mean and SD. Therefore,
for skewed distributions it is often more helpful to use median and IQR to describe the center and spread
for symmetric distributions it is often more helpful to use the mean and SD to describe the center and spread
legosets
To install the brickset
package:
remotes::install_github('jbryer/brickset')
To load the load the legosets
dataset.
data('legosets', package = 'brickset')
The legosets
data has 16355 observations of 34 variables.
names(legosets)
## [1] "setID" "name" "year" ## [4] "theme" "themeGroup" "subtheme" ## [7] "category" "released" "pieces" ## [10] "minifigs" "bricksetURL" "rating" ## [13] "reviewCount" "packagingType" "availability" ## [16] "agerange_min" "US_retailPrice" "US_dateFirstAvailable"## [19] "US_dateLastAvailable" "UK_retailPrice" "UK_dateFirstAvailable"## [22] "UK_dateLastAvailable" "CA_retailPrice" "CA_dateFirstAvailable"## [25] "CA_dateLastAvailable" "DE_retailPrice" "DE_dateFirstAvailable"## [28] "DE_dateLastAvailable" "height" "width" ## [31] "depth" "weight" "thumbnailURL" ## [34] "imageURL"
str
) str(legosets)
## 'data.frame': 16355 obs. of 34 variables:## $ setID : int 7693 7695 7697 7698 25534 7418 7419 6020 22704 7421 ...## $ name : chr "Small house set" "Medium house set" "Medium house set" "Large house set" ...## $ year : int 1970 1970 1970 1970 1970 1970 1970 1970 1970 1970 ...## $ theme : chr "Minitalia" "Minitalia" "Minitalia" "Minitalia" ...## $ themeGroup : chr "Vintage" "Vintage" "Vintage" "Vintage" ...## $ subtheme : chr NA NA NA NA ...## $ category : chr "Normal" "Normal" "Normal" "Normal" ...## $ released : logi TRUE TRUE TRUE TRUE TRUE TRUE ...## $ pieces : int 67 109 158 233 NA 1 1 60 65 NA ...## $ minifigs : int NA NA NA NA NA NA NA NA NA NA ...## $ bricksetURL : chr "https://brickset.com/sets/1-8" "https://brickset.com/sets/2-8" "https://brickset.com/sets/3-6" "https://brickset.com/sets/4-4" ...## $ rating : num 0 0 0 0 0 0 0 0 0 0 ...## $ reviewCount : int 0 0 1 0 0 0 0 1 0 0 ...## $ packagingType : chr "{Not specified}" "{Not specified}" "{Not specified}" "{Not specified}" ...## $ availability : chr "{Not specified}" "{Not specified}" "{Not specified}" "{Not specified}" ...## $ agerange_min : int NA NA NA NA NA NA NA NA NA NA ...## $ US_retailPrice : num NA NA NA NA NA 1.99 NA NA 4.99 NA ...## $ US_dateFirstAvailable: Date, format: NA NA ...## $ US_dateLastAvailable : Date, format: NA NA ...## $ UK_retailPrice : num NA NA NA NA NA NA NA NA NA NA ...## $ UK_dateFirstAvailable: Date, format: NA NA ...## $ UK_dateLastAvailable : Date, format: NA NA ...## $ CA_retailPrice : num NA NA NA NA NA NA NA NA NA NA ...## $ CA_dateFirstAvailable: Date, format: NA NA ...## $ CA_dateLastAvailable : Date, format: NA NA ...## $ DE_retailPrice : num NA NA NA NA NA NA NA NA NA NA ...## $ DE_dateFirstAvailable: Date, format: NA NA ...## $ DE_dateLastAvailable : Date, format: NA NA ...## $ height : num NA NA NA NA NA ...## $ width : num NA NA NA NA NA ...## $ depth : num NA NA NA NA NA NA NA NA 5.08 NA ...## $ weight : num NA NA NA NA NA NA NA NA NA NA ...## $ thumbnailURL : chr "https://images.brickset.com/sets/small/1-8.jpg" "https://images.brickset.com/sets/small/2-8.jpg" "https://images.brickset.com/sets/small/3-6.jpg" "https://images.brickset.com/sets/small/4-4.jpg" ...## $ imageURL : chr "https://images.brickset.com/sets/images/1-8.jpg" "https://images.brickset.com/sets/images/2-8.jpg" "https://images.brickset.com/sets/images/3-6.jpg" "https://images.brickset.com/sets/images/4-4.jpg" ...
setID | name | year | theme | themeGroup | category | US_retailPrice | pieces | minifigs | rating | |
---|---|---|---|---|---|---|---|---|---|---|
1 | 29512 | Mummy Queen | 2019 | Collectable Minifigures | Miscellaneous | Normal | 3.99 | 6 | 1 | 3.7 |
2 | 7650 | Von Nebula | 2010 | HERO Factory | Constraction | Normal | 19.99 | 156 | 4.3 | |
3 | 9937 | Ninjago Kai ZX Kids' Watch | 2012 | Gear | Miscellaneous | Gear | 24.99 | 0 | ||
4 | 6565 | Aeroplane | 2004 | Creator | Model making | Normal | 0 | |||
5 | 28932 | Princess Leia Key Chain | 2019 | Gear | Miscellaneous | Gear | 4.99 | 0 | ||
6 | 24433 | Dressing table | 2015 | Friends | Girls | Other | 22 | 0 | ||
7 | 2741 | Crane and Digger Accessories | 1998 | Service Packs | Miscellaneous | Normal | 4 | 14 | 0 | |
8 | 23821 | Azari and the Magical Bakery | 2015 | Elves | Action/Adventure | Normal | 29.99 | 324 | 2 | 3.8 |
9 | 22536 | Small Freestyle Bucket | 1996 | Freestyle | Basic | Normal | 0 | |||
10 | 29301 | Lady Liberty | 2019 | BrickHeadz | Licensed | Normal | 9.99 | 153 | 4 |
%>%
and |>
The pipe operator (%>%
) introduced with the magrittr
R package allows for the chaining of R operations. Base R has now added their own pipe operator (|>
). They take the output from the left-hand side and passes it as the first parameter to the function on the right-hand side.
You can do this in two steps:
tab_out <- table(legosets$category)prop.table(tab_out)
Or as nested function calls.
prop.table(table(legosets$category))
Using the pipe (|>
) operator we can chain these calls in a what is arguably a more readable format:
table(legosets$category) |> prop.table()
## ## Book Collection Extended Gear Normal Other ## 0.028798533 0.032100275 0.025191073 0.143564659 0.713420972 0.054050749 ## Random ## 0.002873739
!a
- TRUE if a is FALSEa == b
- TRUE if a and be are equala != b
- TRUE if a and b are not equala > b
- TRUE if a is larger than b, but not equala >= b
- TRUE if a is larger or equal to ba < b
- TRUE if a is smaller than be, but not equala <= b
- TRUE if a is smaller or equal to ba %in% b
- TRUE if a is in b where b is a vector which( letters %in% c('a','e','i','o','u') )
## [1] 1 5 9 15 21
a | b
- TRUE if a or b are TRUEa & b
- TRUE if a and b are TRUEisTRUE(a)
- TRUE if a is TRUEdplyr
mylego <- legosets %>% filter(themeGroup == 'Educational' & year > 2015)
mylego <- legosets[legosets$themeGroups == 'Educaitonal' & legosets$year > 2015,]
nrow(mylego)
## [1] 61
dplyr
mylego <- mylego %>% select(setID, pieces, theme, availability, US_retailPrice, minifigs)
mylego <- mylego[,c('setID', 'pieces', 'theme', 'availability', 'US_retailPrice', 'minifigs')]
head(mylego, n = 4)
## setID pieces theme availability US_retailPrice minifigs## 1 26803 103 Education {Not specified} NA 6## 2 26689 142 Education {Not specified} NA 4## 3 26804 98 Education {Not specified} NA 6## 4 26277 188 Education Educational 78.95 NA
dplyr
mylego %>% relocate(where(is.numeric), .after = where(is.character)) %>% head(n = 3)
## theme availability setID pieces US_retailPrice minifigs## 1 Education {Not specified} 26803 103 NA 6## 2 Education {Not specified} 26689 142 NA 4## 3 Education {Not specified} 26804 98 NA 6
mylego2 <- mylego[,c('theme', 'availability', 'setID', 'pieces', 'US_retailPrice', 'minifigs')]head(mylego2, n = 3)
## theme availability setID pieces US_retailPrice minifigs## 1 Education {Not specified} 26803 103 NA 6## 2 Education {Not specified} 26689 142 NA 4## 3 Education {Not specified} 26804 98 NA 6
dplyr
mylego %>% dplyr::rename(USD = US_retailPrice) %>% head(n = 3)
## setID pieces theme availability USD minifigs## 1 26803 103 Education {Not specified} NA 6## 2 26689 142 Education {Not specified} NA 4## 3 26804 98 Education {Not specified} NA 6
names(mylego2)[5] <- 'USD'head(mylego2, n = 3)
## theme availability setID pieces USD minifigs## 1 Education {Not specified} 26803 103 NA 6## 2 Education {Not specified} 26689 142 NA 4## 3 Education {Not specified} 26804 98 NA 6
dplyr
mylego %>% filter(!is.na(pieces) & !is.na(US_retailPrice)) %>% mutate(Price_per_piece = US_retailPrice / pieces) %>% head(n = 3)
## setID pieces theme availability US_retailPrice minifigs Price_per_piece## 1 26277 188 Education Educational 78.95 NA 0.4199468## 2 25949 280 Education Educational 224.95 NA 0.8033929## 3 25954 1 Education Educational 14.95 NA 14.9500000
mylego2 <- mylego[!is.na(mylego$US_retailPrice) & !is.na(mylego$Price_per_piece),]mylego2$Price_per_piece <- mylego2$Price_per_piece / mylego2$US_retailPricehead(mylego2, n = 3)
## [1] setID pieces theme availability ## [5] US_retailPrice minifigs Price_per_piece## <0 rows> (or 0-length row.names)
legosets %>% group_by(themeGroup) %>% summarize(mean_price = mean(US_retailPrice, na.rm = TRUE), sd_price = sd(US_retailPrice, na.rm = TRUE), median_price = median(US_retailPrice, na.rm = TRUE), n = n(), missing = sum(is.na(US_retailPrice)))
## # A tibble: 15 × 6## themeGroup mean_price sd_price median_price n missing## <chr> <dbl> <dbl> <dbl> <int> <int>## 1 Action/Adventure 31.3 29.9 20.0 1280 462## 2 Basic 13.1 12.8 7.99 843 473## 3 Constraction 15.1 14.0 9.99 501 125## 4 Educational 89.0 107. 59.7 452 294## 5 Girls 23.4 22.6 15.0 677 225## 6 Historical 25.5 27.7 15.0 473 125## 7 Junior 18.6 13.2 17.8 228 93## 8 Licensed 42.9 58.3 25.0 2060 467## 9 Miscellaneous 14.3 20.8 6.99 4925 2117## 10 Model making 52.8 65.1 30.0 582 166## 11 Modern day 31.2 33.7 20.0 1723 763## 12 Pre-school 23.8 19.4 20.0 1487 699## 13 Racing 24.8 30.2 10 270 59## 14 Technical 60.8 68.1 40.0 550 137## 15 Vintage 9.71 9.56 7.50 304 264
library(psych)describe(legosets$US_retailPrice)
## vars n mean sd median trimmed mad min max range skew kurtosis se## X1 1 9886 28.52 42 14.99 20.14 14.83 0 799.99 799.99 5.62 58.91 0.42
describeBy(legosets$US_retailPrice, group = legosets$availability, mat = TRUE, skew = FALSE)
## item group1 vars n mean sd min max range se## X11 1 {Not specified} 1 3197 24.24484 36.282072 0.60 789.99 789.39 0.6416833## X12 2 Educational 1 9 140.95000 86.358265 14.95 244.95 230.00 28.7860885## X13 3 LEGO exclusive 1 1066 28.79797 70.954538 0.00 799.99 799.99 2.1732094## X14 4 LEGOLAND exclusive 1 7 12.70429 6.447591 4.99 19.99 15.00 2.4369603## X15 5 Not sold 1 1 12.99000 NA 12.99 12.99 0.00 NA## X16 6 Promotional 1 167 9.19485 23.667555 0.00 249.99 249.99 1.8314504## X17 7 Promotional (Airline) 1 11 15.79455 6.614819 5.00 28.00 23.00 1.9944429## X18 8 Retail 1 4824 29.82030 33.270049 1.95 399.99 398.04 0.4790158## X19 9 Retail - limited 1 600 44.64837 57.391438 0.40 379.99 379.59 2.3429956## X110 10 Unknown 1 4 2.24750 1.253671 1.00 3.99 2.99 0.6268356
For data wrangling:
dplyr
website: https://dplyr.tidyverse.orgComplete the one minute paper: https://forms.gle/p9xcKcTbGiyYSz368
What was the most important thing you learned during this class?
What important question remains unanswered for you?
What was the most important thing you learned during this class?
What important question remains unanswered for you?
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |