class: center, middle, inverse, title-slide # Summarizing Data Part 1 ## DATA 606 - Statistics & Probability for Data Analytics ### Jason Bryer, Ph.D. and Angela Lui, Ph.D. ### February 8, 2023 --- # One Minute Paper Results .pull-left[ **What was the most important thing you learned during this class?** <img src="02-Summarizing_Data1_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> ] .pull-right[ **What important question remains unanswered for you?** <img src="02-Summarizing_Data1_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> ] --- # Workflow .center[ <img src='images/data-science-wrangle.png' alt = 'Data Science Workflow' width='1000' /> ] .font80[Source: [Wickham & Grolemund, 2017](https://r4ds.had.co.nz)] --- # Tidy Data .center[ <img src='images/tidydata_1.jpg' height='500' /> ] See Wickham (2014) [Tidy data](https://vita.had.co.nz/papers/tidy-data.html). --- # Types of Data .pull-left[ * Numerical (quantitative) * Continuous * Discrete ] .pull-right[ * Categorical (qualitative) * Regular categorical * Ordinal ] .center[ <img src='images/continuous_discrete.png' height='400' /> ] --- # Data Types in R <img src="images/DataTypesConceptModel.png" width="1000" style="display: block; margin: auto;" /> --- # Data Types / Descriptives / Visualizations Data Type | Descriptive Stats | Visualization -------------|-----------------------------------------------|-------------------| Continuous | mean, median, mode, standard deviation, IQR | histogram, density, box plot Discrete | contingency table, proportional table, median | bar plot Categorical | contingency table, proportional table | bar plot Ordinal | contingency table, proportional table, median | bar plot Two quantitative | correlation | scatter plot Two qualitative | contingency table, chi-squared | mosaic plot, bar plot Quantitative & Qualitative | grouped summaries, ANOVA, t-test | box plot --- # Variance .pull-left[ Population Variance: $$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$ Consider a dataset with five values (black points in the figure). For the largest value, the deviance is represented by the blue line ( `\(x_i - \bar{x}\)` ). See also: https://shiny.rit.albany.edu/stat/visualizess/ https://github.com/jbryer/VisualStats/ ] .pull-right[ <img src="02-Summarizing_Data1_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> ] --- # Variance (cont.) .pull-left[ Population Variance: $$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$ In the numerator, we square each of these deviances. We can conceptualize this as a square. Here, we add the deviance in the *y* direction. ] .pull-right[ <img src="02-Summarizing_Data1_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> ] --- # Variance (cont.) .pull-left[ Population Variance: $$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$ We end up with a square. ] .pull-right[ <img src="02-Summarizing_Data1_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> ] --- # Variance (cont.) .pull-left[ Population Variance: $$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$ We can plot the squared deviance for all the data points. That is, each component in the numerator is the area of each of these squares. ] .pull-right[ <img src="02-Summarizing_Data1_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> ] --- # Variance (cont.) .pull-left[ Population Variance: $$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$ The variance is therefore the average of the area of all these squares, here represented by the orange square. ] .pull-right[ <img src="02-Summarizing_Data1_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> ] --- # Population versus Sample Variance .pull-left[ Typically we want the sample variance. The difference is we divide by `\(n - 1\)` to calculate the sample variance. This results in a slightly larger area (variance) then if we divide by `\(n\)`. Population Variance (yellow): $$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$ Sample Variance (green): $$ s^2 = \frac{\Sigma (x_i - \bar{x})^2}{n-1}$$ ] .pull-right[ <img src="02-Summarizing_Data1_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> ] --- # Robust Statistics Consider the following data randomly selected from the normal distribution: .pull-left[ ```r set.seed(41) x <- rnorm(30, mean = 100, sd = 15) mean(x); sd(x) ``` ``` ## [1] 103.1934 ``` ``` ## [1] 16.8945 ``` ```r median(x); IQR(x) ``` ``` ## [1] 103.9947 ``` ``` ## [1] 25.68004 ``` ] .pull-right[ <img src="02-Summarizing_Data1_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> ] --- # Robust Statistics <img src="02-Summarizing_Data1_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- # Robust Statistics <img src="02-Summarizing_Data1_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> Let's add an extreme value: ```r x <- c(x, 1000) ``` -- <img src="02-Summarizing_Data1_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> --- # Robust Statistics Median and IQR are more robust to skewness and outliers than mean and SD. Therefore, * for skewed distributions it is often more helpful to use median and IQR to describe the center and spread * for symmetric distributions it is often more helpful to use the mean and SD to describe the center and spread --- class: font80 # About `legosets` <img src="images/hex/brickset.png" class="title-hex"> To install the `brickset` package: ```r remotes::install_github('jbryer/brickset') ``` To load the load the `legosets` dataset. ```r data('legosets', package = 'brickset') ``` The `legosets` data has 16355 observations of 34 variables. .code70[ ```r names(legosets) ``` ``` ## [1] "setID" "name" "year" ## [4] "theme" "themeGroup" "subtheme" ## [7] "category" "released" "pieces" ## [10] "minifigs" "bricksetURL" "rating" ## [13] "reviewCount" "packagingType" "availability" ## [16] "agerange_min" "US_retailPrice" "US_dateFirstAvailable" ## [19] "US_dateLastAvailable" "UK_retailPrice" "UK_dateFirstAvailable" ## [22] "UK_dateLastAvailable" "CA_retailPrice" "CA_dateFirstAvailable" ## [25] "CA_dateLastAvailable" "DE_retailPrice" "DE_dateFirstAvailable" ## [28] "DE_dateLastAvailable" "height" "width" ## [31] "depth" "weight" "thumbnailURL" ## [34] "imageURL" ``` ] --- # Structure (`str`) <img src="images/hex/brickset.png" class="title-hex"> .code50[ ```r str(legosets) ``` ``` ## 'data.frame': 16355 obs. of 34 variables: ## $ setID : int 7693 7695 7697 7698 25534 7418 7419 6020 22704 7421 ... ## $ name : chr "Small house set" "Medium house set" "Medium house set" "Large house set" ... ## $ year : int 1970 1970 1970 1970 1970 1970 1970 1970 1970 1970 ... ## $ theme : chr "Minitalia" "Minitalia" "Minitalia" "Minitalia" ... ## $ themeGroup : chr "Vintage" "Vintage" "Vintage" "Vintage" ... ## $ subtheme : chr NA NA NA NA ... ## $ category : chr "Normal" "Normal" "Normal" "Normal" ... ## $ released : logi TRUE TRUE TRUE TRUE TRUE TRUE ... ## $ pieces : int 67 109 158 233 NA 1 1 60 65 NA ... ## $ minifigs : int NA NA NA NA NA NA NA NA NA NA ... ## $ bricksetURL : chr "https://brickset.com/sets/1-8" "https://brickset.com/sets/2-8" "https://brickset.com/sets/3-6" "https://brickset.com/sets/4-4" ... ## $ rating : num 0 0 0 0 0 0 0 0 0 0 ... ## $ reviewCount : int 0 0 1 0 0 0 0 1 0 0 ... ## $ packagingType : chr "{Not specified}" "{Not specified}" "{Not specified}" "{Not specified}" ... ## $ availability : chr "{Not specified}" "{Not specified}" "{Not specified}" "{Not specified}" ... ## $ agerange_min : int NA NA NA NA NA NA NA NA NA NA ... ## $ US_retailPrice : num NA NA NA NA NA 1.99 NA NA 4.99 NA ... ## $ US_dateFirstAvailable: Date, format: NA NA ... ## $ US_dateLastAvailable : Date, format: NA NA ... ## $ UK_retailPrice : num NA NA NA NA NA NA NA NA NA NA ... ## $ UK_dateFirstAvailable: Date, format: NA NA ... ## $ UK_dateLastAvailable : Date, format: NA NA ... ## $ CA_retailPrice : num NA NA NA NA NA NA NA NA NA NA ... ## $ CA_dateFirstAvailable: Date, format: NA NA ... ## $ CA_dateLastAvailable : Date, format: NA NA ... ## $ DE_retailPrice : num NA NA NA NA NA NA NA NA NA NA ... ## $ DE_dateFirstAvailable: Date, format: NA NA ... ## $ DE_dateLastAvailable : Date, format: NA NA ... ## $ height : num NA NA NA NA NA ... ## $ width : num NA NA NA NA NA ... ## $ depth : num NA NA NA NA NA NA NA NA 5.08 NA ... ## $ weight : num NA NA NA NA NA NA NA NA NA NA ... ## $ thumbnailURL : chr "https://images.brickset.com/sets/small/1-8.jpg" "https://images.brickset.com/sets/small/2-8.jpg" "https://images.brickset.com/sets/small/3-6.jpg" "https://images.brickset.com/sets/small/4-4.jpg" ... ## $ imageURL : chr "https://images.brickset.com/sets/images/1-8.jpg" "https://images.brickset.com/sets/images/2-8.jpg" "https://images.brickset.com/sets/images/3-6.jpg" "https://images.brickset.com/sets/images/4-4.jpg" ... ``` ] --- # RStudio Eenvironment tab can help <img src="images/hex/rstudio.png" class="title-hex"> <img src="images/legosets_rstudio_environment.png" width="500" style="display: block; margin: auto;" /> --- class: hide-logo # Table View .font60[
] --- # Data Wrangling Cheat Sheet <img src="images/hex/dplyr.png" class="title-hex"> .center[ <a href='https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf' target='_new'><img src='images/data-transformation.png' width='700' /></a> ] --- # Tidyverse vs Base R <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/pipe.png" class="title-hex"> .center[ <a href='images/R_Syntax_Comparison.jpeg' target='_new'><img src="images/R_Syntax_Comparison.jpeg" width='700' /></a> ] --- # Pipes `%>%` and `|>` <img src="images/hex/magrittr.png" class="title-hex"> <img src='images/magrittr_pipe.jpg' align='right' width='200' /> .font90[ The pipe operator (`%>%`) introduced with the `magrittr` R package allows for the chaining of R operations. Base R has now added their own pipe operator (`|>`). They take the output from the left-hand side and passes it as the first parameter to the function on the right-hand side. ] .pull-left[ You can do this in two steps: ```r tab_out <- table(legosets$category) prop.table(tab_out) ``` Or as nested function calls. ```r prop.table(table(legosets$category)) ``` ] .pull-right[ Using the pipe (`|>`) operator we can chain these calls in a what is arguably a more readable format: ```r table(legosets$category) |> prop.table() ``` ] <hr /> ``` ## ## Book Collection Extended Gear Normal Other ## 0.028798533 0.032100275 0.025191073 0.143564659 0.713420972 0.054050749 ## Random ## 0.002873739 ``` --- # Filter <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> .center[ <img src='images/dplyr_filter_sm.png' width='800' /> ] --- # Logical Operators * `!a` - TRUE if a is FALSE * `a == b` - TRUE if a and be are equal * `a != b` - TRUE if a and b are not equal * `a > b` - TRUE if a is larger than b, but not equal * `a >= b` - TRUE if a is larger or equal to b * `a < b` - TRUE if a is smaller than be, but not equal * `a <= b` - TRUE if a is smaller or equal to b * `a %in% b` - TRUE if a is in b where b is a vector ```r which( letters %in% c('a','e','i','o','u') ) ``` ``` ## [1] 1 5 9 15 21 ``` * `a | b` - TRUE if a *or* b are TRUE * `a & b` - TRUE if a *and* b are TRUE * `isTRUE(a)` - TRUE if a is TRUE --- # Filter <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> ### `dplyr` ```r mylego <- legosets %>% filter(themeGroup == 'Educational' & year > 2015) ``` ### Base R ```r mylego <- legosets[legosets$themeGroups == 'Educaitonal' & legosets$year > 2015,] ``` <hr /> ```r nrow(mylego) ``` ``` ## [1] 61 ``` --- # Select <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> ### `dplyr` ```r mylego <- mylego %>% select(setID, pieces, theme, availability, US_retailPrice, minifigs) ``` ### Base R ```r mylego <- mylego[,c('setID', 'pieces', 'theme', 'availability', 'US_retailPrice', 'minifigs')] ``` <hr /> ```r head(mylego, n = 4) ``` ``` ## setID pieces theme availability US_retailPrice minifigs ## 1 26803 103 Education {Not specified} NA 6 ## 2 26689 142 Education {Not specified} NA 4 ## 3 26804 98 Education {Not specified} NA 6 ## 4 26277 188 Education Educational 78.95 NA ``` --- # Relocate <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> .center[ <img src='images/dplyr_relocate.png' width='800' /> ] --- # Relocate <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> ### `dplyr` ```r mylego %>% relocate(where(is.numeric), .after = where(is.character)) %>% head(n = 3) ``` ``` ## theme availability setID pieces US_retailPrice minifigs ## 1 Education {Not specified} 26803 103 NA 6 ## 2 Education {Not specified} 26689 142 NA 4 ## 3 Education {Not specified} 26804 98 NA 6 ``` ### Base R ```r mylego2 <- mylego[,c('theme', 'availability', 'setID', 'pieces', 'US_retailPrice', 'minifigs')] head(mylego2, n = 3) ``` ``` ## theme availability setID pieces US_retailPrice minifigs ## 1 Education {Not specified} 26803 103 NA 6 ## 2 Education {Not specified} 26689 142 NA 4 ## 3 Education {Not specified} 26804 98 NA 6 ``` --- # Rename <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> .center[ <img src='images/rename_sm.jpg' width='1000' /> ] --- # Rename <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> ### `dplyr` ```r mylego %>% dplyr::rename(USD = US_retailPrice) %>% head(n = 3) ``` ``` ## setID pieces theme availability USD minifigs ## 1 26803 103 Education {Not specified} NA 6 ## 2 26689 142 Education {Not specified} NA 4 ## 3 26804 98 Education {Not specified} NA 6 ``` ### Base R ```r names(mylego2)[5] <- 'USD' head(mylego2, n = 3) ``` ``` ## theme availability setID pieces USD minifigs ## 1 Education {Not specified} 26803 103 NA 6 ## 2 Education {Not specified} 26689 142 NA 4 ## 3 Education {Not specified} 26804 98 NA 6 ``` --- # Mutate <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> .center[ <img src='images/dplyr_mutate.png' width='700' /> ] --- # Mutate <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> ### `dplyr` ```r mylego %>% filter(!is.na(pieces) & !is.na(US_retailPrice)) %>% mutate(Price_per_piece = US_retailPrice / pieces) %>% head(n = 3) ``` ``` ## setID pieces theme availability US_retailPrice minifigs Price_per_piece ## 1 26277 188 Education Educational 78.95 NA 0.4199468 ## 2 25949 280 Education Educational 224.95 NA 0.8033929 ## 3 25954 1 Education Educational 14.95 NA 14.9500000 ``` ### Base R ```r mylego2 <- mylego[!is.na(mylego$US_retailPrice) & !is.na(mylego$Price_per_piece),] mylego2$Price_per_piece <- mylego2$Price_per_piece / mylego2$US_retailPrice head(mylego2, n = 3) ``` ``` ## [1] setID pieces theme availability ## [5] US_retailPrice minifigs Price_per_piece ## <0 rows> (or 0-length row.names) ``` --- # Group By and Summarize <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> .code80[ ```r legosets %>% group_by(themeGroup) %>% summarize(mean_price = mean(US_retailPrice, na.rm = TRUE), sd_price = sd(US_retailPrice, na.rm = TRUE), median_price = median(US_retailPrice, na.rm = TRUE), n = n(), missing = sum(is.na(US_retailPrice))) ``` ``` ## # A tibble: 15 × 6 ## themeGroup mean_price sd_price median_price n missing ## <chr> <dbl> <dbl> <dbl> <int> <int> ## 1 Action/Adventure 31.3 29.9 20.0 1280 462 ## 2 Basic 13.1 12.8 7.99 843 473 ## 3 Constraction 15.1 14.0 9.99 501 125 ## 4 Educational 89.0 107. 59.7 452 294 ## 5 Girls 23.4 22.6 15.0 677 225 ## 6 Historical 25.5 27.7 15.0 473 125 ## 7 Junior 18.6 13.2 17.8 228 93 ## 8 Licensed 42.9 58.3 25.0 2060 467 ## 9 Miscellaneous 14.3 20.8 6.99 4925 2117 ## 10 Model making 52.8 65.1 30.0 582 166 ## 11 Modern day 31.2 33.7 20.0 1723 763 ## 12 Pre-school 23.8 19.4 20.0 1487 699 ## 13 Racing 24.8 30.2 10 270 59 ## 14 Technical 60.8 68.1 40.0 550 137 ## 15 Vintage 9.71 9.56 7.50 304 264 ``` ] --- # Describe and Describe By ```r library(psych) describe(legosets$US_retailPrice) ``` ``` ## vars n mean sd median trimmed mad min max range skew kurtosis se ## X1 1 9886 28.52 42 14.99 20.14 14.83 0 799.99 799.99 5.62 58.91 0.42 ``` ```r describeBy(legosets$US_retailPrice, group = legosets$availability, mat = TRUE, skew = FALSE) ``` ``` ## item group1 vars n mean sd min max range se ## X11 1 {Not specified} 1 3197 24.24484 36.282072 0.60 789.99 789.39 0.6416833 ## X12 2 Educational 1 9 140.95000 86.358265 14.95 244.95 230.00 28.7860885 ## X13 3 LEGO exclusive 1 1066 28.79797 70.954538 0.00 799.99 799.99 2.1732094 ## X14 4 LEGOLAND exclusive 1 7 12.70429 6.447591 4.99 19.99 15.00 2.4369603 ## X15 5 Not sold 1 1 12.99000 NA 12.99 12.99 0.00 NA ## X16 6 Promotional 1 167 9.19485 23.667555 0.00 249.99 249.99 1.8314504 ## X17 7 Promotional (Airline) 1 11 15.79455 6.614819 5.00 28.00 23.00 1.9944429 ## X18 8 Retail 1 4824 29.82030 33.270049 1.95 399.99 398.04 0.4790158 ## X19 9 Retail - limited 1 600 44.64837 57.391438 0.40 379.99 379.59 2.3429956 ## X110 10 Unknown 1 4 2.24750 1.253671 1.00 3.99 2.99 0.6268356 ``` --- # Additional Resources For data wrangling: * `dplyr` website: https://dplyr.tidyverse.org * R for Data Science book: https://r4ds.had.co.nz/wrangle-intro.html * Wrangling penguins tutorial: https://allisonhorst.shinyapps.io/dplyr-learnr/#section-welcome * Data transformation cheat sheet: https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf --- # One Minute Paper Complete the one minute paper: https://forms.gle/p9xcKcTbGiyYSz368 1. What was the most important thing you learned during this class? 2. What important question remains unanswered for you?