Summarizing Data Part 1DATA 606 - Statistics & Probability for Data AnalyticsJason Bryer, Ph.D. and Angela Lui, Ph.D.February 8, 20231 / 38

One Minute Paper Results

What was the most important thing you learned during this class?

What important question remains unanswered for you?

2 / 38

Workflow

Data Science Workflow

Source: Wickham & Grolemund, 2017

3 / 38

Tidy Data

See Wickham (2014) Tidy data.

4 / 38

Types of Data

Numerical (quantitative)
- Continuous
- Discrete

Categorical (qualitative)
- Regular categorical
- Ordinal

5 / 38

Data Types in R

6 / 38

Data Types / Descriptives / Visualizations

Data Type
Descriptive Stats
Visualization


Continuous
mean, median, mode, standard deviation, IQR
histogram, density, box plot

Discrete
contingency table, proportional table, median
bar plot

Categorical
contingency table, proportional table
bar plot

Ordinal
contingency table, proportional table, median
bar plot

Two quantitative
correlation
scatter plot

Two qualitative
contingency table, chi-squared
mosaic plot, bar plot

Quantitative & Qualitative
grouped summaries, ANOVA, t-test
box plot

7 / 38

Data Type	Descriptive Stats	Visualization
Continuous	mean, median, mode, standard deviation, IQR	histogram, density, box plot
Discrete	contingency table, proportional table, median	bar plot
Categorical	contingency table, proportional table	bar plot
Ordinal	contingency table, proportional table, median	bar plot
Two quantitative	correlation	scatter plot
Two qualitative	contingency table, chi-squared	mosaic plot, bar plot
Quantitative & Qualitative	grouped summaries, ANOVA, t-test	box plot

Variance

Population Variance: $S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$ Consider a dataset with five values (black points in the figure). For the largest value, the deviance is represented by the blue line ( $x_i - \bar{x}$ ).

8 / 38

Variance (cont.)

Population Variance: $S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$ In the numerator, we square each of these deviances. We can conceptualize this as a square. Here, we add the deviance in the y direction.

9 / 38

Variance (cont.)

Population Variance: $S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$

We end up with a square.

10 / 38

Variance (cont.)

Population Variance: $S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$ We can plot the squared deviance for all the data points. That is, each component in the numerator is the area of each of these squares.

11 / 38

Variance (cont.)

Population Variance: $S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$ The variance is therefore the average of the area of all these squares, here represented by the orange square.

12 / 38

Population versus Sample Variance

Typically we want the sample variance. The difference is we divide by $n - 1$ to calculate the sample variance. This results in a slightly larger area (variance) then if we divide by $n$ .

Population Variance (yellow): $S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$

Sample Variance (green): $s^2 = \frac{\Sigma (x_i - \bar{x})^2}{n-1}$

13 / 38

Robust Statistics

Consider the following data randomly selected from the normal distribution:

set.seed(41)
x <- rnorm(30, mean = 100, sd = 15)
mean(x); sd(x)

## [1] 103.1934

## [1] 16.8945

median(x); IQR(x)

## [1] 103.9947

## [1] 25.68004

14 / 38

Robust Statistics

15 / 38

Robust Statistics

Let's add an extreme value:

x <- c(x, 1000)

16 / 38

Robust Statistics

Let's add an extreme value:

x <- c(x, 1000)

16 / 38

Robust Statistics

Median and IQR are more robust to skewness and outliers than mean and SD. Therefore,

for skewed distributions it is often more helpful to use median and IQR to describe the center and spread
for symmetric distributions it is often more helpful to use the mean and SD to describe the center and spread

17 / 38

About `legosets`

To install the brickset package:

remotes::install_github('jbryer/brickset')

To load the load the legosets dataset.

data('legosets', package = 'brickset')

The legosets data has 16355 observations of 34 variables.

names(legosets)

##  [1] "setID"                 "name"                  "year"                 
##  [4] "theme"                 "themeGroup"            "subtheme"             
##  [7] "category"              "released"              "pieces"               
## [10] "minifigs"              "bricksetURL"           "rating"               
## [13] "reviewCount"           "packagingType"         "availability"         
## [16] "agerange_min"          "US_retailPrice"        "US_dateFirstAvailable"
## [19] "US_dateLastAvailable"  "UK_retailPrice"        "UK_dateFirstAvailable"
## [22] "UK_dateLastAvailable"  "CA_retailPrice"        "CA_dateFirstAvailable"
## [25] "CA_dateLastAvailable"  "DE_retailPrice"        "DE_dateFirstAvailable"
## [28] "DE_dateLastAvailable"  "height"                "width"                
## [31] "depth"                 "weight"                "thumbnailURL"         
## [34] "imageURL"

18 / 38

Structure (str) str(legosets)

## 'data.frame':    16355 obs. of  34 variables:
##  $ setID                : int  7693 7695 7697 7698 25534 7418 7419 6020 22704 7421 ...
##  $ name                 : chr  "Small house set" "Medium house set" "Medium house set" "Large house set" ...
##  $ year                 : int  1970 1970 1970 1970 1970 1970 1970 1970 1970 1970 ...
##  $ theme                : chr  "Minitalia" "Minitalia" "Minitalia" "Minitalia" ...
##  $ themeGroup           : chr  "Vintage" "Vintage" "Vintage" "Vintage" ...
##  $ subtheme             : chr  NA NA NA NA ...
##  $ category             : chr  "Normal" "Normal" "Normal" "Normal" ...
##  $ released             : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ pieces               : int  67 109 158 233 NA 1 1 60 65 NA ...
##  $ minifigs             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ bricksetURL          : chr  "https://brickset.com/sets/1-8" "https://brickset.com/sets/2-8" "https://brickset.com/sets/3-6" "https://brickset.com/sets/4-4" ...
##  $ rating               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ reviewCount          : int  0 0 1 0 0 0 0 1 0 0 ...
##  $ packagingType        : chr  "{Not specified}" "{Not specified}" "{Not specified}" "{Not specified}" ...
##  $ availability         : chr  "{Not specified}" "{Not specified}" "{Not specified}" "{Not specified}" ...
##  $ agerange_min         : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ US_retailPrice       : num  NA NA NA NA NA 1.99 NA NA 4.99 NA ...
##  $ US_dateFirstAvailable: Date, format: NA NA ...
##  $ US_dateLastAvailable : Date, format: NA NA ...
##  $ UK_retailPrice       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ UK_dateFirstAvailable: Date, format: NA NA ...
##  $ UK_dateLastAvailable : Date, format: NA NA ...
##  $ CA_retailPrice       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ CA_dateFirstAvailable: Date, format: NA NA ...
##  $ CA_dateLastAvailable : Date, format: NA NA ...
##  $ DE_retailPrice       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ DE_dateFirstAvailable: Date, format: NA NA ...
##  $ DE_dateLastAvailable : Date, format: NA NA ...
##  $ height               : num  NA NA NA NA NA ...
##  $ width                : num  NA NA NA NA NA ...
##  $ depth                : num  NA NA NA NA NA NA NA NA 5.08 NA ...
##  $ weight               : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ thumbnailURL         : chr  "https://images.brickset.com/sets/small/1-8.jpg" "https://images.brickset.com/sets/small/2-8.jpg" "https://images.brickset.com/sets/small/3-6.jpg" "https://images.brickset.com/sets/small/4-4.jpg" ...
##  $ imageURL             : chr  "https://images.brickset.com/sets/images/1-8.jpg" "https://images.brickset.com/sets/images/2-8.jpg" "https://images.brickset.com/sets/images/3-6.jpg" "https://images.brickset.com/sets/images/4-4.jpg" ...
19 / 38

RStudio Eenvironment tab can help

20 / 38

Table View

Show entries

Search:

	setID	name	year	theme	themeGroup	category	US_retailPrice	pieces	minifigs	rating
1	29512	Mummy Queen	2019	Collectable Minifigures	Miscellaneous	Normal	3.99	6	1	3.7
2	7650	Von Nebula	2010	HERO Factory	Constraction	Normal	19.99	156		4.3
3	9937	Ninjago Kai ZX Kids' Watch	2012	Gear	Miscellaneous	Gear	24.99			0
4	6565	Aeroplane	2004	Creator	Model making	Normal				0
5	28932	Princess Leia Key Chain	2019	Gear	Miscellaneous	Gear	4.99			0
6	24433	Dressing table	2015	Friends	Girls	Other		22		0
7	2741	Crane and Digger Accessories	1998	Service Packs	Miscellaneous	Normal	4	14		0
8	23821	Azari and the Magical Bakery	2015	Elves	Action/Adventure	Normal	29.99	324	2	3.8
9	22536	Small Freestyle Bucket	1996	Freestyle	Basic	Normal				0
10	29301	Lady Liberty	2019	BrickHeadz	Licensed	Normal	9.99	153		4

Showing 1 to 10 of 100 entries

Previous1 2 3 4 5…10Next

21 / 38

Data Wrangling Cheat Sheet

22 / 38

Tidyverse vs Base R

23 / 38

Pipes `%>%` and `|>`

The pipe operator (%>%) introduced with the magrittr R package allows for the chaining of R operations. Base R has now added their own pipe operator (|>). They take the output from the left-hand side and passes it as the first parameter to the function on the right-hand side.

You can do this in two steps:

tab_out <- table(legosets$category)
prop.table(tab_out)

Or as nested function calls.

prop.table(table(legosets$category))

Using the pipe (|>) operator we can chain these calls in a what is arguably a more readable format:

table(legosets$category) |> prop.table()

## 
##        Book  Collection    Extended        Gear      Normal       Other 
## 0.028798533 0.032100275 0.025191073 0.143564659 0.713420972 0.054050749 
##      Random 
## 0.002873739

24 / 38

Filter

25 / 38

Logical Operators

!a - TRUE if a is FALSE
a == b - TRUE if a and be are equal
a != b - TRUE if a and b are not equal
a > b - TRUE if a is larger than b, but not equal
a >= b - TRUE if a is larger or equal to b
a < b - TRUE if a is smaller than be, but not equal
a <= b - TRUE if a is smaller or equal to b
a %in% b - TRUE if a is in b where b is a vector

which( letters %in% c('a','e','i','o','u') )

## [1]  1  5  9 15 21

a | b - TRUE if a or b are TRUE
a & b - TRUE if a and b are TRUE
isTRUE(a) - TRUE if a is TRUE

26 / 38

Filter

`dplyr`

mylego <- legosets %>% filter(themeGroup == 'Educational' & year > 2015)

Base R

mylego <- legosets[legosets$themeGroups == 'Educaitonal' & legosets$year > 2015,]

nrow(mylego)

## [1] 61

27 / 38

Select

`dplyr`

mylego <- mylego %>% select(setID, pieces, theme, availability, US_retailPrice, minifigs)

Base R

mylego <- mylego[,c('setID', 'pieces', 'theme', 'availability', 'US_retailPrice', 'minifigs')]

head(mylego, n = 4)

##   setID pieces     theme    availability US_retailPrice minifigs
## 1 26803    103 Education {Not specified}             NA        6
## 2 26689    142 Education {Not specified}             NA        4
## 3 26804     98 Education {Not specified}             NA        6
## 4 26277    188 Education     Educational          78.95       NA

28 / 38

Relocate

29 / 38

Relocate

`dplyr`

mylego %>% relocate(where(is.numeric), .after = where(is.character)) %>% head(n = 3)

##       theme    availability setID pieces US_retailPrice minifigs
## 1 Education {Not specified} 26803    103             NA        6
## 2 Education {Not specified} 26689    142             NA        4
## 3 Education {Not specified} 26804     98             NA        6

Base R

mylego2 <- mylego[,c('theme', 'availability', 'setID', 'pieces', 'US_retailPrice', 'minifigs')]
head(mylego2, n = 3)

##       theme    availability setID pieces US_retailPrice minifigs
## 1 Education {Not specified} 26803    103             NA        6
## 2 Education {Not specified} 26689    142             NA        4
## 3 Education {Not specified} 26804     98             NA        6

30 / 38

Rename

31 / 38

Rename

`dplyr`

mylego %>% dplyr::rename(USD = US_retailPrice) %>% head(n = 3)

##   setID pieces     theme    availability USD minifigs
## 1 26803    103 Education {Not specified}  NA        6
## 2 26689    142 Education {Not specified}  NA        4
## 3 26804     98 Education {Not specified}  NA        6

Base R

names(mylego2)[5] <- 'USD'
head(mylego2, n = 3)

##       theme    availability setID pieces USD minifigs
## 1 Education {Not specified} 26803    103  NA        6
## 2 Education {Not specified} 26689    142  NA        4
## 3 Education {Not specified} 26804     98  NA        6

32 / 38

Mutate

33 / 38

Mutate

`dplyr`

mylego %>% filter(!is.na(pieces) & !is.na(US_retailPrice)) %>% 
    mutate(Price_per_piece = US_retailPrice / pieces) %>% head(n = 3)

##   setID pieces     theme availability US_retailPrice minifigs Price_per_piece
## 1 26277    188 Education  Educational          78.95       NA       0.4199468
## 2 25949    280 Education  Educational         224.95       NA       0.8033929
## 3 25954      1 Education  Educational          14.95       NA      14.9500000

Base R

mylego2 <- mylego[!is.na(mylego$US_retailPrice) & !is.na(mylego$Price_per_piece),]
mylego2$Price_per_piece <- mylego2$Price_per_piece / mylego2$US_retailPrice
head(mylego2, n = 3)

## [1] setID           pieces          theme           availability   
## [5] US_retailPrice  minifigs        Price_per_piece
## <0 rows> (or 0-length row.names)

34 / 38

Group By and Summarize legosets %>% group_by(themeGroup) %>% summarize(mean_price = mean(US_retailPrice, na.rm = TRUE),
                                                sd_price = sd(US_retailPrice, na.rm = TRUE),
                                                median_price = median(US_retailPrice, na.rm = TRUE),
                                                n = n(),
                                                missing = sum(is.na(US_retailPrice)))

## # A tibble: 15 × 6
##    themeGroup       mean_price sd_price median_price     n missing
##    <chr>                 <dbl>    <dbl>        <dbl> <int>   <int>
##  1 Action/Adventure      31.3     29.9         20.0   1280     462
##  2 Basic                 13.1     12.8          7.99   843     473
##  3 Constraction          15.1     14.0          9.99   501     125
##  4 Educational           89.0    107.          59.7    452     294
##  5 Girls                 23.4     22.6         15.0    677     225
##  6 Historical            25.5     27.7         15.0    473     125
##  7 Junior                18.6     13.2         17.8    228      93
##  8 Licensed              42.9     58.3         25.0   2060     467
##  9 Miscellaneous         14.3     20.8          6.99  4925    2117
## 10 Model making          52.8     65.1         30.0    582     166
## 11 Modern day            31.2     33.7         20.0   1723     763
## 12 Pre-school            23.8     19.4         20.0   1487     699
## 13 Racing                24.8     30.2         10      270      59
## 14 Technical             60.8     68.1         40.0    550     137
## 15 Vintage                9.71     9.56         7.50   304     264
35 / 38

Describe and Describe By

library(psych)
describe(legosets$US_retailPrice)

##    vars    n  mean sd median trimmed   mad min    max  range skew kurtosis   se
## X1    1 9886 28.52 42  14.99   20.14 14.83   0 799.99 799.99 5.62    58.91 0.42

describeBy(legosets$US_retailPrice, group = legosets$availability, mat = TRUE, skew = FALSE)

##      item                group1 vars    n      mean        sd   min    max  range         se
## X11     1       {Not specified}    1 3197  24.24484 36.282072  0.60 789.99 789.39  0.6416833
## X12     2           Educational    1    9 140.95000 86.358265 14.95 244.95 230.00 28.7860885
## X13     3        LEGO exclusive    1 1066  28.79797 70.954538  0.00 799.99 799.99  2.1732094
## X14     4    LEGOLAND exclusive    1    7  12.70429  6.447591  4.99  19.99  15.00  2.4369603
## X15     5              Not sold    1    1  12.99000        NA 12.99  12.99   0.00         NA
## X16     6           Promotional    1  167   9.19485 23.667555  0.00 249.99 249.99  1.8314504
## X17     7 Promotional (Airline)    1   11  15.79455  6.614819  5.00  28.00  23.00  1.9944429
## X18     8                Retail    1 4824  29.82030 33.270049  1.95 399.99 398.04  0.4790158
## X19     9      Retail - limited    1  600  44.64837 57.391438  0.40 379.99 379.59  2.3429956
## X110   10               Unknown    1    4   2.24750  1.253671  1.00   3.99   2.99  0.6268356

36 / 38

Additional Resources

For data wrangling:

dplyr website: https://dplyr.tidyverse.org
R for Data Science book: https://r4ds.had.co.nz/wrangle-intro.html
Wrangling penguins tutorial: https://allisonhorst.shinyapps.io/dplyr-learnr/#section-welcome
Data transformation cheat sheet: https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf

37 / 38

One Minute Paper

Complete the one minute paper: https://forms.gle/p9xcKcTbGiyYSz368

What was the most important thing you learned during this class?
What important question remains unanswered for you?

38 / 38

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

Summarizing Data Part 1

DATA 606 - Statistics & Probability for Data Analytics

Jason Bryer, Ph.D. and Angela Lui, Ph.D.

February 8, 2023

One Minute Paper Results

Workflow

Tidy Data

Types of Data

Data Types in R

Data Types / Descriptives / Visualizations

Variance

Variance (cont.)

Variance (cont.)

Variance (cont.)

Variance (cont.)

Population versus Sample Variance

Robust Statistics

Robust Statistics

Robust Statistics

Robust Statistics

Robust Statistics

About legosets

Structure (str)

RStudio Eenvironment tab can help

Table View

Data Wrangling Cheat Sheet

Tidyverse vs Base R

Pipes %>% and |>

Filter

Logical Operators

Filter

dplyr

Base R

Select

dplyr

Base R

Relocate

Relocate

dplyr

Base R

Rename

Rename

dplyr

Base R

Mutate

Mutate

dplyr

Base R

Group By and Summarize

Describe and Describe By

Additional Resources

One Minute Paper

One Minute Paper Results

Help

About `legosets`

Structure (`str`)

Pipes `%>%` and `|>`

`dplyr`

`dplyr`

`dplyr`

`dplyr`

`dplyr`